HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format
https://labelstud.io
Apache License 2.0
17.91k stars 2.23k forks source link

Named Entity Recognition - Incorrect spans shown after labelling #4988

Open pdhall99 opened 9 months ago

pdhall99 commented 9 months ago

Describe the bug In a Named Entity Recognition project the incorrect span is shown after labelling in some cases.

To Reproduce Steps to reproduce the behavior:

  1. Create a "Named Entity Recognition" project and import the following as a .txt ("Treat as list of tasks"):
    👨🏻‍🚒 firemen drive firetrucks at work
  2. Click "Label All Tasks"
  3. Select firetrucks" to be labelled
  4. Note "ve firetru" is selected as the label and the end of the text is cut off (see screenshot), but "firetrucks" is correctly marked in the exported JSON.

Expected behavior Selected word is "firetrucks" is highlighted as the labelled span.

Screenshots

Screenshot 2023-10-31 at 05 55 32 Screenshot 2023-10-31 at 05 55 46

Environment (please complete the following information):

Additional context I assume this is something related to label indices (start and end) being positions in either a sequence of 16-bit Unicode code units (as they are in TypeScript/JavaScript) or in a sequence of Unicode code points (as they are in Python).

Take text = "👨🏻‍🚒 firemen drive firetrucks at work" as an example. Suppose we label the word "firetrucks":

See this Better Programming article for further explanation.

I note that the code unit span (19, 29) (the correct code point span is (19, 29)) corresponds to the code point span (16, 26), for which text[16:26] == "ve firetru", as is displayed.

Possibly related issues:

jombooth commented 9 months ago

Hi @pdhall99 - this bug may be related to the # of bytes in the emoji. We'll reproduce it on our side + file a ticket to get this addressed. Thank you for the detailed report!

mlumingu-ugent commented 1 month ago

Some extra info:

Screenshot from 2024-06-25 10-43-53

After selecting the first hand: Screenshot from 2024-06-25 10-44-01

After selecting the 1: Screenshot from 2024-06-25 10-44-14

After selecting the 1 and the hand: Screenshot from 2024-06-25 10-44-28

I thought this was related and probably will be fixed together, but let me know and I will file a separate bug!