code-kern-ai / refinery

The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
https://www.kern.ai
Apache License 2.0
1.4k stars 68 forks source link

[BUG] - UNKNOWN label appears for certain manually uploaded #226

Closed xavialex closed 1 year ago

xavialex commented 1 year ago

In certain cases, the label UNKNOWN appears. While working on a dataset of 2k records approx. I realized that one of the manually uploaded labels wasn't showing up. When searching for it, I found out the UNKOWN label (image attached). How to detect it or prevent it?. Unfortunately, while isolating the problem into a smaller set I haven't faced the issue, just in my actual dataset, so I cannot share more details. The worst part is that the record of interest becomes unusable, not being able to select another label, erase it, etc.

Extra info:

See in Discord

image image
JWittmeyer commented 1 year ago

Hi @xavialex,

usually, this shouldn't happen so it's somewhat hard to analyze 😀

Firstly the UNKOWN Label/task combination in the data browser indicates that the label id used in the record couldn't be matched to the project.

Since this doesn't seem to be a connection issue where e.g all label data couldn't be collected from the backend this seems to be an issue related to the actual record_label_associatoin (how we store label data per record).

With a little bit of data base manipulation, I could reproduce a similar state: image

For this to happen I "simply" switched out the label id from my actual project to one of a different project.

Since label ids are collected from the project during labeling in refinery I'd assume that this was caused by some upload issue where a label name was provided and matched to an existing one without project relation. (I'll look into this later if I can maybe find something in the source code).

to identify the records with the issue I could run this query on the database.

SELECT r.project_id, r.data
FROM record_label_association rla
INNER JOIN labeling_task_label ltl
    ON rla.project_id <> ltl.project_id AND rla.labeling_task_label_id = ltl.id
INNER JOIN record r
    ON rla.record_id = r.id AND rla.project_id = r.project_id

Depending on the amount we can try different solutions:

  1. < 10 record -> remove record label associations and label by hand
  2. > 10 record -> try to identify the wrong label id and a name -> find the name -> find the correct label of the actual project and replace them. (I can provide some queries for that as well)

Note that all of this is under my previous assumption so it might not be true for your case.

Since there is a lot of guesswork involved from my side it might be best to look into this together. Would you be willing to schedule a short call where we can look into it (e.g. via discord)?

xavialex commented 1 year ago

Thanks @JWittmeyer. Absolutely, we can have a short call discussing this, since I haven't been able to give you a proper reproduction set. We can do it in the next Office Hours on Wednesday or at other time, lets discuss it through Discord, my nickname is the same, so ping me any time.

JWittmeyer commented 1 year ago

After we looked into the issue it seems to be related to a specific record set and not as previously assumed by setting a label id from a different project. The actual effect is a NULL value in the database.

Since we currently have no way to reproduce it we are working on a failsafe.

The failsafe includes the query with the correct project id.

SELECT * FROM record_label_association rla 
WHERE rla.project_id = ' '
 AND rla.labeling_task_label_id IS NULL

The Failsafe will also be run by refinery after any import occurs.

Failsafe steps:

  1. Run query to check NULL entries exist
  2. If > 0 -> create labeling task "import_issues" with label "reference_error"
  3. Set the label id for all faulty entries
  4. Notify the user

With this approach, we should be able to identify faulty records so the user can handle them accordingly.

Note that this is only a workaround until we can safely reproduce (and of curse eliminate) the error.

JWittmeyer commented 1 year ago

After some testing and comparison with previous versions, this issue is fixed with version 1.8.0. The code lines relevant to the issue were already changed with pr https://github.com/code-kern-ai/refinery-gateway/pull/112.

The failsafe logic will stay included for now. If the error reappears reopen the issue.

@xavialex thanks again for your help :)