HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format
https://labelstud.io
Apache License 2.0
19.31k stars 2.4k forks source link

NER Model Predictions Not Visible in Label Studio UI #5325

Closed fernando080 closed 8 months ago

fernando080 commented 9 months ago

Environment:

Label Studio Version: 1.10.1 Operating System: Debian using docker base operanting system for python 3.10 Browser: Brave

Description

I am developing a system for annotating text extracted from document images to train a Named Entity Recognition (NER) model. The workflow includes the following steps:

  1. Extracting text from images using Tesseract.
  2. Converting the extracted TSV files to HTML, using the positioning fields to mimic the layout of the original documents. This aids annotators in their work.
  3. The system is set up to display model predictions in the appropriate UI field.

Issue: Although the system functions correctly and the model generates predictions, these predictions are not visible in the Label Studio User Interface (UI).

To Reproduce

Steps to reproduce the behavior: I defined my onw prediction method and I've commented the fit method (do not know it is a good practice). I let you below the code just for the prediccion function and also some images with the hmlt and the results of the predicction. This code is on developing so forgive me if it has not got all the formalities coding best practices:

def predict(self, tasks: List[Dict], context: Optional[Dict] = None, **kwargs):
    # We are going to use the module of deed_extract_python.toolkits.xlnet_toolkit.XLNetToolkit to predict the
    # labels in the text if the task. So we use the function extract_entities_from_page() to get the labels of each
    # task.

    model = model_manager[config["ner_toolkit"]](config["model"])._model
    tokenizer = XLNetTokenizerFast.from_pretrained('xlnet-base-cased', do_lower_case=False)

    print(f'''\
    Run prediction on {tasks}
    Received context: {context}
    Project ID: {self.project_id}
    Label config: {self.label_config}
    Parsed JSON Label config: {self.parsed_label_config}''')

    predictions = []
    results = []

    for task in tasks:
        task_id = task['id']
        html_url = task['data']['text']
        html_content = self.load_html_from_gs(html_url)
        text = self.extract_text_from_html(html_content)

        # Here we made the prediccions
        raw_text, pred_labels, pred_scores, entities = self.evaluate_one_text(text, ids_to_labels, tokenizer, model)

        print("Results from prediction:")
        print(f"Raw text: {raw_text}")
        print("\n")
        print(f"Predicted labels: {pred_labels}")
        print("\n")
        print(f"Predicted scores: {pred_scores}")
        print("\n")
        print(f"Entities: {entities}")
        print("\n")

        results.append({
            'result': self.map_predictions_to_html(entities, html_content), # Here we get the prediccitons json check it bellow
            'score': float(np.mean(pred_scores)),
            'cluster': None
        })

    return results

# def fit(self, event, data, **kwargs):
#     """
#     This method is called each time an annotation is created or updated
#     You can run your logic here to update the model and persist it to the cache
#     It is not recommended to perform long-running operations here, as it will block the main thread
#     Instead, consider running a separate process or a thread (like RQ worker) to perform the training
#     :param event: event type can be ('ANNOTATION_CREATED', 'ANNOTATION_UPDATED')
#     :param data: the payload received from the event (check [Webhook event reference](https://labelstud.io/guide/webhook_reference.html))
#     """

#     # use cache to retrieve the data from the previous fit() runs
#     old_data = self.get('my_data')
#     old_model_version = self.get('model_version')
#     print(f'Old data: {old_data}')
#     print(f'Old model version: {old_model_version}')

#     # store new data to the cache
#     self.set('my_data', 'my_new_data_value')
#     self.set('model_version', 'my_new_model_version')
#     print(f'New data: {self.get("my_data")}')
#     print(f'New model version: {self.get("model_version")}')

#     print('fit() completed successfully.')

Prediction Json:

results.append({
                        'from_name': 'label', # Assume this is a conf name from label studio
                        'to_name': 'text', # Assumne this is a conf name from label studio
                        'type': 'labels',
                        'value': {
                            'labels': [label],
                            'start': self.transform_xpath(start_xpath),
                            'end': self.transform_xpath(end_xpath),
                        }
                    })

The XPaths correspond with the location of the word in the html check out the test.txt below

Print results:

ml-backend | Results from prediction: ml-backend | Raw text: MINUTES  OF  INITIAL  MEETING  OF  TRUSTEES  OF  KOCH  TRUST  ESTABLISHMENT  OF  TRUST  It  was  noted  that  Nitish  Kocchar  and  Latisha  Kocchar  had  accepted  the  trustees hip  of  a  Trust  to  be  known  as  the  Koch  Trust  and  signed  the  deed  establishing  the  Tr ust. ml-backend | ml-backend | ml-backend | Predicted labels: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-TRUST_NAME', 'I-TRUST_NAME', ' O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-TRUSTEE', 'I-TRUSTEE', 'O', 'B-TRUSTEE', 'I-TRUSTEE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-TRUST_NAME', 'I-TRUST_NAME', 'O', 'O', 'O', 'O', 'O', 'O', 'O'] ml-backend | ml-backend | ml-backend | Predicted scores: [0.999915599822998, 0.9999840259552002, 0.9999932050704956, 0.9999850 988388062, 0.9999867677688599, 0.9999879598617554, 0.9999464750289917, 0.998420238494873, 0.999955534 9349976, 0.9999643564224243, 0.999983549118042, 0.9999808073043823, 0.9999972581863403, 0.99999892711 6394, 0.9999990463256836, 0.9999992847442627, 0.9673433303833008, 0.9404548406600952, 0.9999908208847 046, 0.9350007772445679, 0.7964836955070496, 0.9999990463256836, 0.9999979734420776, 0.99999678134918 21, 0.9999977350234985, 0.9999927282333374, 0.999994158744812, 0.9999958276748657, 0.9999971389770508 , 0.999996542930603, 0.9999955892562866, 0.9999860525131226, 0.9987388253211975, 0.9764958620071411, 0.9928774237632751, 0.9999988079071045, 0.9999983310699463, 0.9999979734420776, 0.9999985694885254, 0 .9999966621398926, 0.9999934434890747, 0.9999920129776001] ml-backend | ml-backend | ml-backend | Entities: {'TRUST_NAME': [{'text': [], 'start': 7, 'end': 8}, {'text': [], 'start': 33, 'end': 34}], 'TRUSTEE': [{'text': [], 'start': 16, 'end': 17}, {'text': [], 'start': 19, 'end': 20}] } ml-backend | ml-backend |

Debug ml-bakend response to label studio

ml-backend | [2024-01-22 17:26:46,069] [DEBUG] [label_studio_ml.api::log_response_info::143] Respons e body: b'{"results":[{"cluster":null,"result":[{"from_name":"label","to_name":"text","type":"labels" ,"value":{"end":"/div[1]/div[2]/span[2]/text()[1]","labels":["TRUST_NAME"],"start":"/div[1]/div[2]/sp an[1]/text()[1]"}},{"from_name":"label","to_name":"text","type":"labels","value":{"end":"/div[1]/div[ 4]/span[23]/text()[1]","labels":["TRUST_NAME"],"start":"/div[1]/div[4]/span[22]/text()[1]"}},{"from_n ame":"label","to_name":"text","type":"labels","value":{"end":"/div[1]/div[4]/span[6]/text()[1]","labe ls":["TRUSTEE"],"start":"/div[1]/div[4]/span[5]/text()[1]"}},{"from_name":"label","to_name":"text","t ype":"labels","value":{"end":"/div[1]/div[4]/span[9]/text()[1]","labels":["TRUSTEE"],"start":"/div[1] /div[4]/span[8]/text()[1]"}}],"score":0.9906049782321567}]}\n'

Expected behavior

The predictions made by the NER model should be clearly visible and highlighted in the Label Studio UI.

Screenshots

Input text Predictions Model Settings

I let you a text file that contains the information from the html so you just need to change the extension from .txt to .html to get the original. test.txt

makseq commented 9 months ago

Could you show your labeling config?

fernando080 commented 9 months ago

Of course I'll attach the images with the code and visual interfaces: image

image

makseq commented 9 months ago

also let's check errors in the browser console, what do you see there?

fernando080 commented 9 months ago

Ok it was a bit hard for me to find it but I think that is the next image: image

makseq commented 9 months ago

Can you try to do this:

  1. create an annotation manually in LS and submit it
  2. get the task code with this annotation (in the data manager, </> button)
  3. get the result field from the annotation and copy it as is into your ML backend code where you generate predictions. Just rewrite your output with this annotation.result field.
  4. try to run your ML backend if this prediction works?
fernando080 commented 9 months ago

Ok thank you so much, with the next:

result = [
            {
                "id": "iLe_ghP23A",
                "type": "labels",
                "value": {
                    "end": "/div[1]/div[2]/span[2]/text()[1]",
                    "start": "/div[1]/div[2]/span[1]/text()[1]",
                    "labels": ["TRUST_NAME"],
                    "endOffset": 5,
                    "startOffset": 0,
                    "globalOffsets": {
                        "end": 665,
                        "start": 642
                    }
                },
                "origin": "manual",
                "to_name": "text",
                "from_name": "label"
            },
            {
                "id": "CUeF6I-wWV",
                "type": "labels",
                "value": {
                    "end": "/div[1]/div[4]/span[6]/text()[1]",
                    "start": "/div[1]/div[4]/span[5]/text()[1]",
                    "labels": ["TRUSTEE"],
                    "endOffset": 7,
                    "startOffset": 0,
                    "globalOffsets": {
                        "end": 874,
                        "start": 847
                    }
                },
                "origin": "manual",
                "to_name": "text",
                "from_name": "label"
            },
            {
                "id": "SW_Sztf2SD",
                "type": "labels",
                "value": {
                    "labels": ["TRUSTEE"]
                  },
                "origin": "manual",
                "to_name": "text",
                "from_name": "label"
            },
            {
                "id": "oc_Gi8irt1",
                "type": "labels",
                "value": {
                    "end": "/div[1]/div[4]/span[9]/text()[1]",
                    "start": "/div[1]/div[4]/span[8]/text()[1]",
                    "labels": ["TRUSTEE"],
                    "endOffset": 7,
                    "startOffset": 0,
                    "globalOffsets": {
                        "end": 933,
                        "start": 905
                    }
                },
                "origin": "manual",
                "to_name": "text",
                "from_name": "label"
            },
            {
                "id": "mmJZPe8wsc",
                "type": "labels",
                "value": {
                    "labels": ["TRUSTEE"]
                },
                "origin": "manual",
                "to_name": "text",
                "from_name": "label"
            },
            {
                "id": "Qkihaz1GYG",
                "type": "labels",
                "value": {
                    "labels": ["TRUSTEE"]
                },
                "origin": "manual",
                "to_name": "text",
                "from_name": "label"
            },
            {
                "id": "zvNtY18B-k",
                "type": "labels",
                "value": {
                    "labels": ["TRUSTEE"]
                },
                "origin": "manual",
                "to_name": "text",
                "from_name": "label"
            },
            {
                "id": "d1OrqsQQz2",
                "type": "labels",
                "value": {
                    "labels": ["TRUSTEE"]
                },
                "origin": "manual",
                "to_name": "text",
                "from_name": "label"
            }
        ]

        results1.append({
            'result': result,
            'score': 0.5,
            'cluster': None
        })

The ml-backend is working with the results form the annotations, but what did I miss? An Id or difine the offsets keys? maybe the empty labels at the bottom?

makseq commented 9 months ago

please provide your prediction.result. NOT as image, but as text.

fernando080 commented 9 months ago

Ok so sorry @makseq I was not clear enough , the json I sent you before is the content of prediction.result (what I meant was is copy that from the annotations content, what you suggested me to do). I let you below the json again (but copied from the Label Studio UI):

"predictions": [
    {
      "id": 41,
      "model_version": "INITIAL",
      "created_ago": "2 hours, 22 minutes",
      "result": [
        {
          "id": "iLe_ghP23A",
          "type": "labels",
          "value": {
            "end": "/div[1]/div[2]/span[2]/text()[1]",
            "start": "/div[1]/div[2]/span[1]/text()[1]",
            "labels": [
              "TRUST_NAME"
            ],
            "endOffset": 5,
            "startOffset": 0,
            "globalOffsets": {
              "end": 665,
              "start": 642
            }
          },
          "origin": "manual",
          "to_name": "text",
          "from_name": "label"
        },
        {
          "id": "CUeF6I-wWV",
          "type": "labels",
          "value": {
            "end": "/div[1]/div[4]/span[6]/text()[1]",
            "start": "/div[1]/div[4]/span[5]/text()[1]",
            "labels": [
              "TRUSTEE"
            ],
            "endOffset": 7,
            "startOffset": 0,
            "globalOffsets": {
              "end": 874,
              "start": 847
            }
          },
          "origin": "manual",
          "to_name": "text",
          "from_name": "label"
        },
        {
          "id": "SW_Sztf2SD",
          "type": "labels",
          "value": {
            "labels": [
              "TRUSTEE"
            ]
          },
          "origin": "manual",
          "to_name": "text",
          "from_name": "label"
        },
        {
          "id": "oc_Gi8irt1",
          "type": "labels",
          "value": {
            "end": "/div[1]/div[4]/span[9]/text()[1]",
            "start": "/div[1]/div[4]/span[8]/text()[1]",
            "labels": [
              "TRUSTEE"
            ],
            "endOffset": 7,
            "startOffset": 0,
            "globalOffsets": {
              "end": 933,
              "start": 905
            }
          },
          "origin": "manual",
          "to_name": "text",
          "from_name": "label"
        },
        {
          "id": "mmJZPe8wsc",
          "type": "labels",
          "value": {
            "labels": [
              "TRUSTEE"
            ]
          },
          "origin": "manual",
          "to_name": "text",
          "from_name": "label"
        },
        {
          "id": "Qkihaz1GYG",
          "type": "labels",
          "value": {
            "labels": [
              "TRUSTEE"
            ]
          },
          "origin": "manual",
          "to_name": "text",
          "from_name": "label"
        },
        {
          "id": "zvNtY18B-k",
          "type": "labels",
          "value": {
            "labels": [
              "TRUSTEE"
            ]
          },
          "origin": "manual",
          "to_name": "text",
          "from_name": "label"
        },
        {
          "id": "d1OrqsQQz2",
          "type": "labels",
          "value": {
            "labels": [
              "TRUSTEE"
            ]
          },
          "origin": "manual",
          "to_name": "text",
          "from_name": "label"
        }
      ],
      "score": 0.5,
      "cluster": null,
      "neighbors": null,
      "mislabeling": 0,
      "created_at": "2024-01-23T18:48:10.216634Z",
      "updated_at": "2024-01-23T18:48:10.216663Z",
      "task": 159671,
      "project": 3
    }
  ]

Is this what are you asking me for?

You can see it in the next image that what you suggested me is working:

image

So my questions are: What did I miss? An Id or difine the offsets keys? maybe the empty labels at the bottom?

makseq commented 9 months ago

Sorry, I messed up.

What is your prediction.result from your ML backend? and what is annotation.result? can you make a message with this two jsons.

Because your last message contains pretty similar result jsons and I don't understand what they are from.

fernando080 commented 9 months ago

In my ml-backend I made prediction.result = annotation.result due to the third bullet of your previous message:

Can you try to do this:

  1. create an annotation manually in LS and submit it
  2. get the task code with this annotation (in the data manager, </> button)
  3. get the result field from the annotation and copy it as is into your ML backend code where you generate predictions. Just rewrite your output with this annotation.result field.
  4. try to run your ML backend if this prediction works?

So the JSON you see in my last message is exactly the same for prediction and annotation

makseq commented 9 months ago

But where is you json produced by ml backend now? I need it in the text format.

fernando080 commented 8 months ago

Hi @makseq sorry for my late response but I on holidays then bussy and I had no time. So the problem here was that you need the complete the offset keys in the returned dictionary the start and the end that was what I miss in my code and thank you so much for your help cause you gave me the idea to fill the predictions in ml backend with the content of annatations and then making removals of each key in the ml backed dictionary and so on I noticed that the key values to the the predictions were the offset keys :)

makseq commented 8 months ago

It seems you've got the solution!