[Records] Remove `tokens` field for token classification, allow to configure a dataset tokenizer, and set one by default

frascuchon commented 1 year ago

Targets:

Simplify record ingestion by only passing the text content
Create more reusable datasets depending on the tokenization method used. (the same dataset can be used for training spacy and huggingface by changing the tokenizer)

davidberenstein1957 commented 1 year ago

@frascuchon how do you see this with potential alignment issues when logging predictions from either the spaCy or Hugging Face tokenizer?

frascuchon commented 1 year ago

Argilla spans definitions are at the character level. The only thing we must ensure is that character spans are aligned with the tokenization. So, selecting the right dataset tokenization before logging the record should resolve this problem.

We can also relax/adapt the spans definition for the misalignment found (something similar to the aligment_mode="expand" in the Doc.char_span method.

dvsrepo commented 1 year ago

Will it be removed or made optional? I can still see the value of providing pretokenized text.

Related to this, will the tokenization (especially if we let users configure a tokenizer) happen on the server or the client side of Argilla?

frascuchon commented 1 year ago

API will still accept the tokens fields as an optional value.

The client can set up the dataset tokenization on data logging, or directly fill the token list. So records will be sent to the server with the selected tokenization.

The main idea of this feature is to create training datasets for different model training. For those cases, stored tokens may be discarded depending on the model to train

Let's discuss different use cases for token classification data ingestion.

dvsrepo commented 1 year ago

Nice!

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity.

mnschmit commented 1 year ago

I would really love to see this in Argilla! My use case is exactly that: Use the same data set for training different models that might or might not use the same tokenizer. Additionally, I'd like to be able to log predictions coming from any model and on any granularity (characters if need be). One use case I need this for is logging spans that, for example, only contain two white space characters. The application is marking potential errors in a writing-aid app where accidentally writing more than one space is a common mistake.

I experimented with replacing whitespace with ▁ like sentencepiece tokenizers do to get it to work with Argilla at the moment but I'm not sure yet about any long-term repercussions that might have... I mainly also had to do that because of these lines here where whitespace is treated differently when tokens and text are aligned.

From my experience so far, I'd vote for only storing character spans and the raw text and only caring about token boundaries on dataset export. But I'm also curious about any downsides that might have.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open for 90 days with no activity.

davidberenstein1957 commented 9 months ago

@mnschmit @anakin87 we are hard at work at tackling this issue at the moment. Would any of you be interested in providing some feedback and pointers w.r.t. what you would expect from the implementation? If so, could you ping me on Slack or send me an email at david@argilla.io?

nataliaElv commented 7 months ago

Closing this issue, as this improvement has now been incorporated into the SpanQuestion in Feedback datasets.

argilla-io / argilla

[Records] Remove `tokens` field for token classification, allow to configure a dataset tokenizer, and set one by default #1947