HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format
https://labelstud.io
Apache License 2.0
18.29k stars 2.29k forks source link

Incorrect start and end values after exporting NER annotation from label studio #4929

Open aimlnerd opened 11 months ago

aimlnerd commented 11 months ago

Describe the bug There are two related bugs

  1. Incorrect start and end values after exporting NER annotation from label studio
  2. Annotated text disappear in the UI after clicking update button.

To Reproduce Steps to reproduce the behavior: BUG 1.

  1. Use Labelling setup as "Natural language processing" & "Named entity recognition". Used template

    <View>
    <Labels name="label" toName="text">
    
    <Label value="BROKER" background="red"/>
    </Labels>
    
    <Text name="text" value="$text"/>
    </View>
  2. Import the below json file
[
    {
        "data": {
            "text": " Internal\n \n \n \n \n From: xebastixxn boutderh <xebastixxn.boutderh@raetsheren.ie>"
        },
        "predictions": [
            {
                "model_version": "1",
                "result": [
                    {
                        "id": "0",
                        "from_name": "label",
                        "to_name": "text",
                        "type": "labels",
                        "value": {
                            "start": 66,
                            "end": 76,
                            "text": "raetsheren",
                            "labels": [
                                "BROKER"
                            ]
                        }
                    }
                ]
            }
        ]
    }
]
  1. Annotate "boutderh" and "m: xeba". Then click update button.
  2. Export the annotation and i get the below json as export
    [
    {
        "id": 2,
        "annotations": [
            {
                "id": 10,
                "completed_by": 2,
                "result": [
                    {
                        "value": {
                            "start": 66,
                            "end": 76,
                            "text": "raetsheren",
                            "labels": [
                                "BROKER"
                            ]
                        },
                        "id": "dkC21nnb4j",
                        "from_name": "label",
                        "to_name": "text",
                        "type": "labels",
                        "origin": "manual"
                    },
                    {
                        "value": {
                            "start": 22,
                            "end": 29,
                            "text": "",
                            "labels": [
                                "BROKER"
                            ]
                        },
                        "id": "aSt8Gf72Uo",
                        "from_name": "label",
                        "to_name": "text",
                        "type": "labels",
                        "origin": "manual"
                    },
                    {
                        "value": {
                            "start": 36,
                            "end": 44,
                            "text": "boutderh",
                            "labels": [
                                "BROKER"
                            ]
                        },
                        "id": "fCvW71VqNq",
                        "from_name": "label",
                        "to_name": "text",
                        "type": "labels",
                        "origin": "manual"
                    }
                ],
                "was_cancelled": false,
                "ground_truth": false,
                "created_at": "2023-10-19T12:32:17.421365Z",
                "updated_at": "2023-10-19T12:40:20.229661Z",
                "draft_created_at": null,
                "lead_time": 299.172,
                "prediction": {
                    "id": 7,
                    "model_version": "1",
                    "created_ago": "0 minutes",
                    "result": [
                        {
                            "id": "0",
                            "from_name": "label",
                            "to_name": "text",
                            "type": "labels",
                            "value": {
                                "start": 66,
                                "end": 76,
                                "text": "raetsheren",
                                "labels": [
                                    "BROKER"
                                ]
                            }
                        }
                    ],
                    "score": null,
                    "cluster": null,
                    "neighbors": null,
                    "mislabeling": 0.0,
                    "created_at": "2023-10-19T12:31:58.596681Z",
                    "updated_at": "2023-10-19T12:31:58.596681Z",
                    "task": 2
                },
                "result_count": 0,
                "unique_id": "4e212892-694e-4bef-94a0-b777f2344de2",
                "import_id": null,
                "last_action": null,
                "task": 2,
                "project": 2,
                "updated_by": 2,
                "parent_prediction": 7,
                "parent_annotation": null,
                "last_created_by": null
            }
        ],
        "file_upload": "edaa6c9d-641b1e0b94a5239ce5b8b740_deemoji.json",
        "drafts": [],
        "predictions": [
            7
        ],
        "data": {
            "text": " Internal\n \n \n \n \n From: xebastixxn boutderh <xebastixxn.boutderh@raetsheren.ie>"
        },
        "meta": {},
        "created_at": "2023-10-19T12:31:58.592674Z",
        "updated_at": "2023-10-19T12:40:20.296930Z",
        "inner_id": 2,
        "total_annotations": 1,
        "cancelled_annotations": 0,
        "total_predictions": 1,
        "comment_count": 0,
        "unresolved_comment_count": 0,
        "last_comment_updated_at": null,
        "project": 2,
        "updated_by": 2,
        "comment_authors": []
    }
    ]

    When i try to get back the annotation from the start and end i get incorrect 'boutde' instead of 'boutderh'

text[36:42] 'boutde'

Also for annotation with id "aSt8Gf72Uo", the text is empty instead of "m: xeba" BUG 2. After annotating "boutderh" in the text, i see that label studio correctly shows the text annotated as "boutderh", the moment i click on update button the text value is not shown anymore. Looks like it is not saved in the database?

See below before hitting update button image After hitting update button image

Expected behavior Bug 1 start and end should be 36, 44 instead of 36, 42 in the json text[36:42] 'boutde'

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context Would be great to resolve this bug to continue with the POC

hogepodge commented 11 months ago

Thanks @deepak-george. This is a fairly deep issue with how Label Studio manages text, including the handling of escaped characters, whitespace, and breaks. I'm bringing it up with the engineering team to explore options for moving forward (either improved documentation for how text is handled, and expectations for post-processing of annotated text; or improvements to text handling to remove that element of surprise of start/end values not matching inputted text). Hopefully I'll have something to report back relatively soon.

aimlnerd commented 11 months ago

Is there an older version of label studio without this bug? So that I can continue the POC?

aimlnerd commented 11 months ago

Users would expect text to be processed by label studio the same way python processes text like how python handles special characters, special alphabets, emojis etc. When we import json file outputted by a python processes and when same text is exported by label studio, the length of text in output should remain the same. Label studio should not add any extra characters or change some characters etc.

This would enable stability of label studio I think. Is that possible?

hogepodge commented 11 months ago

Yes, I agree with that sentiment. We're discussing the best path forward right now. We don't have an immediate fix, but I'm hoping we'll have something soon.

MattHag commented 9 months ago

I did some research regarding the spans and can replicate span issues with these characters.

\/ becomes / and umlauts äbecomes \u00e4 (I wrote everything as readable ASCII characters here, so don't copy and paste to replicate something)

Using the input text (left), the start and end positions don't match with the text. Using json.load() and json.save(), then the text uses the characters on the right and the start and end indexes exported from Label Studio as JSON match with the text run once through JSON read and write.

lior-airis commented 10 hours ago

any progess on this? I'm having similar issues