Is your feature request related to a problem? Please describe.
I want to use refinery to label for information extraction, but cannot upload my existing labels, which sets me back in my project by a large margin.
Describe the solution you'd like
I want to tokenize my data in a notebook with the same tokenizer that refinery uses. I would then match the labels to the respective tokens. Technically, this would be realised through a JSON attribute, e.g. label__headline__MANUAL with the key of that being a list with one label per token, e.g. ["0", "0", "PERSON", "0"] (the "0" could also be null or anything other that is specified in the docs). This data, I want to upload to refinery. During the tokenization process, I want refinery to tell me if the internal tokenizer and my pre-tokenized data does not match. If so, there are two levels of complexity I can imagine:
simple: it should stop the tokenization process and throw an error that the tokenization did not match my pre-provided tokens (in length)
medium: it should additionally tell me what record caused this and what the tokenization lengths were (e.g. refinery produced 200 tokens while I only provided a list of 193 tokens)
Describe alternatives you've considered
hacking the project import/export functionality, which is rather complicated.
Is your feature request related to a problem? Please describe. I want to use refinery to label for information extraction, but cannot upload my existing labels, which sets me back in my project by a large margin.
Describe the solution you'd like I want to tokenize my data in a notebook with the same tokenizer that refinery uses. I would then match the labels to the respective tokens. Technically, this would be realised through a JSON attribute, e.g.
label__headline__MANUAL
with the key of that being a list with one label per token, e.g.["0", "0", "PERSON", "0"]
(the "0" could also benull
or anything other that is specified in the docs). This data, I want to upload to refinery. During the tokenization process, I want refinery to tell me if the internal tokenizer and my pre-tokenized data does not match. If so, there are two levels of complexity I can imagine:Describe alternatives you've considered hacking the project import/export functionality, which is rather complicated.
Additional context