argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
https://docs.argilla.io
Apache License 2.0
3.75k stars 352 forks source link

Token classification weak labelling #1749

Open dvsrepo opened 2 years ago

dvsrepo commented 2 years ago

As in the text classification task, the "Weak labeling" mode in token classification must allow tag entities by defining a query and entity label (the rule).

Given a rule, the Weak labeling mode for token classification will tag entities based on the matched tokens/words in the search results returned by the API.

The way the entity will be tagged from the matched token will be determined by a labeling function provided as an attribute of the rule. For now, just one single labeling function will be supported, the exact_match, where all matched tokens/words will be tagged as the provided rule.

For example, given a labeling rule with the query Par*, the label PLACE, and the matched record Paris is the city of light, the labeling function will tag the token Paris as a PLACE.

An important behavior of this feature is to provide a visualization of the tagged entities in the visible records from the UI.

frascuchon commented 1 year ago

Things to be analyzed:

The request:

GET http://.../labeling/rules/dd:search?label=ALF&labeling_function=lab_D

will return:

{
  "total": 1,
  "records": [
    {
      "id": "522a4e28-57fd-4f58-8da8-0b5117f716de",
      "status": "Default",
      "annotation": {
        "agent": "lab_D",
        "entities": [
          {
            "start": 0,
            "end": 4,
            "label": "ALF",
            "score": 1
          }
        ]
      },
      "annotations": {
        "lab_D": {
          "entities": [
            {
              "start": 0,
              "end": 4,
              "label": "ALF",
              "score": 1
            }
          ]
        }
      },
      "metrics": {},
      "text": "what do you think?",
      "tokens": [
        "what",
        "do",
        "you",
        "think?"
      ]
    }
  ]
}

By using the new annotations fields, the API can combine the labeling function matches with the original annotations.

dvsrepo commented 1 year ago

Thanks! Yes, let's take a look on Friday together.

Things to be analyzed:

  • How to display entities generated by the rule labeling function for records with already predictions/annotations

Do you mean "materialized" entities by the labeling rule or just the matching/selected tokens defined in the weak labeling mode? Currently, I understand we are using the "pink" color font highlighter to indicate the matched tokens, right?

  • The rule labeling function processing must be computed by the server API, otherwise, there will be duplicated logic for the UI app and the python client.

The request:

GET http://.../labeling/rules/dd:search?label=ALF&labeling_function=lab_D

will return:

{
  "total": 1,
  "records": [
    {
      "id": "522a4e28-57fd-4f58-8da8-0b5117f716de",
      "status": "Default",
      "annotation": {
        "agent": "lab_D",
        "entities": [
          {
            "start": 0,
            "end": 4,
            "label": "ALF",
            "score": 1
          }
        ]
      },
      "annotations": {
        "lab_D": {
          "entities": [
            {
              "start": 0,
              "end": 4,
              "label": "ALF",
              "score": 1
            }
          ]
        }
      },
      "metrics": {},
      "text": "what do you think?",
      "tokens": [
        "what",
        "do",
        "you",
        "think?"
      ]
    }
  ]
}

By using the new annotations fields, the API can combine the labeling function matches with the original annotations.

This can be really great indeed. Also agree on unifying server and client computation of labeling functions.

Amelie-V commented 1 year ago

Decision Notes

Record list

Module to set rules

issam9 commented 1 year ago

This is a great feature. Would like to ask how conflicts between weak labels will be resolved in this case. If we have a weak label for Par* as ANIMAL and Paris as LOCATION then which one will be applied for this record: Paris is the city of light. I suggest something like a majority voter then pick randomly if the votes are equal.

A question also arises when validating records, are we going to be able to validate both predictions and weak labels at the same time (Their union), or we will have a choice to either validate predictions or weak labels?

davidberenstein1957 commented 1 year ago

@issam9 Thank you for the feedback! These are two issues that we hadn´t thought about yet.

We will still need to define a roadmap and priorities, but all ideas and input will help. What is your view on the usage of multi-token matching? Similarly, do you feel something like POS matching could help?

@Amelie-V did you have any specific questions? Maybe we could share a mock-up of the UI to get some input as well?

davidberenstein1957 commented 1 year ago

Also, think about including docs and reference TextClassification usecases too #1986

cceyda commented 1 year ago

I would also like to propose a semi-automatic tagging approach from the UI. Kind of like a "Bulk Mode" for token tagging per token.

Annotator searches -> the matching records are shown( token already highlighted in red)-> picks a tag to apply-> then select which records to apply them to-> bulk applies. Think of it like the 'find and replace next' flow. Might save queries used in the process to somewhere

davidberenstein1957 commented 1 year ago

Hi @cceyda, great suggestion, so the idea is to directly apply weak labels as annotations? I have made something like that as a background plugin, which might be helpful for your usecase. We are still fine-tuning this plugin effort but let me know what you think.

https://github.com/argilla-io/argilla-plugins/blob/main/argilla_plugins/programmatic_labelling/token_copycat.py

cceyda commented 1 year ago

@davidberenstein1957 since there can be wrong-matches I want the annotator to be able to inspect the weak labels (search filter results) from the UI and select which ones would be safe to apply as annotations. So instead of doing a "find"-> "replace(annotate) all" you do "find"-> "annotate|skip" & move on to next record