code-kern-ai / refinery

The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
https://www.kern.ai
Apache License 2.0
1.39k stars 66 forks source link

Allow the user to upload Information Extraction labels #257

Open DerKernigeFeuerpfeil opened 1 year ago

DerKernigeFeuerpfeil commented 1 year ago

Is your feature request related to a problem? Please describe. I want to use refinery to label for information extraction, but cannot upload my existing labels, which sets me back in my project by a large margin.

Describe the solution you'd like I want to tokenize my data in a notebook with the same tokenizer that refinery uses. I would then match the labels to the respective tokens. Technically, this would be realised through a JSON attribute, e.g. label__headline__MANUAL with the key of that being a list with one label per token, e.g. ["0", "0", "PERSON", "0"] (the "0" could also be null or anything other that is specified in the docs). This data, I want to upload to refinery. During the tokenization process, I want refinery to tell me if the internal tokenizer and my pre-tokenized data does not match. If so, there are two levels of complexity I can imagine:

  1. simple: it should stop the tokenization process and throw an error that the tokenization did not match my pre-provided tokens (in length)
  2. medium: it should additionally tell me what record caused this and what the tokenization lengths were (e.g. refinery produced 200 tokens while I only provided a list of 193 tokens)

Describe alternatives you've considered hacking the project import/export functionality, which is rather complicated.

Additional context