Closed frascuchon closed 7 months ago
@frascuchon how do you see this with potential alignment issues when logging predictions from either the spaCy or Hugging Face tokenizer?
Argilla spans definitions are at the character level. The only thing we must ensure is that character spans are aligned with the tokenization. So, selecting the right dataset tokenization before logging the record should resolve this problem.
We can also relax/adapt the spans definition for the misalignment found (something similar to the aligment_mode="expand"
in the Doc.char_span
method.
Will it be removed or made optional? I can still see the value of providing pretokenized text.
Related to this, will the tokenization (especially if we let users configure a tokenizer) happen on the server or the client side of Argilla?
API will still accept the tokens
fields as an optional value.
The client can set up the dataset tokenization on data logging, or directly fill the token list. So records will be sent to the server with the selected tokenization.
The main idea of this feature is to create training datasets for different model training. For those cases, stored tokens may be discarded depending on the model to train
Let's discuss different use cases for token classification data ingestion.
Nice!
This issue is stale because it has been open for 30 days with no activity.
I would really love to see this in Argilla! My use case is exactly that: Use the same data set for training different models that might or might not use the same tokenizer. Additionally, I'd like to be able to log predictions coming from any model and on any granularity (characters if need be). One use case I need this for is logging spans that, for example, only contain two white space characters. The application is marking potential errors in a writing-aid app where accidentally writing more than one space is a common mistake.
I experimented with replacing whitespace with ▁ like sentencepiece tokenizers do to get it to work with Argilla at the moment but I'm not sure yet about any long-term repercussions that might have... I mainly also had to do that because of these lines here where whitespace is treated differently when tokens and text are aligned.
From my experience so far, I'd vote for only storing character spans and the raw text and only caring about token boundaries on dataset export. But I'm also curious about any downsides that might have.
This issue is stale because it has been open for 90 days with no activity.
@mnschmit @anakin87 we are hard at work at tackling this issue at the moment. Would any of you be interested in providing some feedback and pointers w.r.t. what you would expect from the implementation? If so, could you ping me on Slack or send me an email at david@argilla.io?
Closing this issue, as this improvement has now been incorporated into the SpanQuestion
in Feedback datasets.
Targets: