argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
https://docs.argilla.io
Apache License 2.0
3.79k stars 354 forks source link

[FEATURE]Auto-annotation of Repeated Tokens #5302

Open bikash119 opened 1 month ago

bikash119 commented 1 month ago

Is your feature request related to a problem? Please describe. There are situations when same bigrams,trigrams, etc appear multiple times in a text being annotated. The annotator has to repeatedly annotate the n-grams, else the tokens will be labelled as "0" under IOB scheme by default.

Describe the solution you'd like Currently, Argilla UI enables us to annotate/label tokens in a text with an easy-to-use interface. However, I've identified a use case where an additional feature could enhance efficiency:

Sample claim text:

The method of claim 1 that includes the step of locating said pillow directly between a tympanic membrane and a round window membrane, but without contacting the round window membrane to block the approach of the tympanic membrane into close proximity to said round window membrane.

Assume we have labels like: ["method of use", "product", "machine", "system"] Here first occurrence of token tympanic membrane is labelled as product by annotator. Since there are multiple instances of the tympanic membrane, the annotator must annotate each instance appropriately, else the system implicitly annotates them as 'O' per the IOB scheme to each token of the bigram. This makes it harder for the model to learn that "tympanic membrane" is a product and shouldn't be treated as two different tokens "tympanic" and "membrane".

Proposed Improvement

When an annotator labels a token (e.g., "tympanic membrane" as "product"), the system could automatically identify and suggest the same label for all exact matches of that token in the text. This would:

Reduce repetitive labeling actions Save significant time, especially in longer texts with recurring terms Ensure consistency in labeling across the document Prevent accidental omissions that could lead to incorrect 'O' labels in the IOB scheme

davidberenstein1957 commented 1 month ago

@nataliaElv, seems like an interesting feature that has some overlap with bulk for span questions.

davidberenstein1957 commented 1 month ago

@frascuchon would this be easier now that we have the advanced queries included.