Open aimlnerd opened 1 year ago
Thanks @deepak-george. This is a fairly deep issue with how Label Studio manages text, including the handling of escaped characters, whitespace, and breaks. I'm bringing it up with the engineering team to explore options for moving forward (either improved documentation for how text is handled, and expectations for post-processing of annotated text; or improvements to text handling to remove that element of surprise of start/end values not matching inputted text). Hopefully I'll have something to report back relatively soon.
Is there an older version of label studio without this bug? So that I can continue the POC?
Users would expect text to be processed by label studio the same way python processes text like how python handles special characters, special alphabets, emojis etc. When we import json file outputted by a python processes and when same text is exported by label studio, the length of text in output should remain the same. Label studio should not add any extra characters or change some characters etc.
This would enable stability of label studio I think. Is that possible?
Yes, I agree with that sentiment. We're discussing the best path forward right now. We don't have an immediate fix, but I'm hoping we'll have something soon.
I did some research regarding the spans and can replicate span issues with these characters.
\/
becomes /
and umlauts
ä
becomes \u00e4
(I wrote everything as readable ASCII characters here, so don't copy and paste to replicate something)
Using the input text (left), the start and end positions don't match with the text. Using json.load() and json.save(), then the text uses the characters on the right and the start and end indexes exported from Label Studio as JSON match with the text run once through JSON read and write.
any progess on this? I'm having similar issues
Describe the bug There are two related bugs
To Reproduce Steps to reproduce the behavior: BUG 1.
Use Labelling setup as "Natural language processing" & "Named entity recognition". Used template
When i try to get back the annotation from the start and end i get incorrect 'boutde' instead of 'boutderh'
text[36:42]
'boutde'Also for annotation with id "aSt8Gf72Uo", the text is empty instead of "m: xeba" BUG 2. After annotating "boutderh" in the text, i see that label studio correctly shows the text annotated as "boutderh", the moment i click on update button the text value is not shown anymore. Looks like it is not saved in the database?
See below before hitting update button After hitting update button
Expected behavior Bug 1 start and end should be 36, 44 instead of 36, 42 in the json
text[36:42]
'boutde'Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context Would be great to resolve this bug to continue with the POC