HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format
https://labelstud.io
Apache License 2.0
18.19k stars 2.28k forks source link

globalOffsets is wrong for HTML Entity Recognition #4843

Open sinchir0 opened 11 months ago

sinchir0 commented 11 months ago

Describe the bug When labeling an HTML file using the HTML Entity Recognition template, the globalOffsets produced when exporting the result appear to be wrong.

To Reproduce

  1. Take the following HTML and save it to a .html file
<!DOCTYPE html>
<html lang="ja">
<head>
  <meta charset="UTF-8">
  <title>サンプル</title>
</head>
<body>

  <h1>こんにちは!</h1>
  <p>これはサンプルです。</p>

</body>
</html>
  1. Import it into a HTML Entity Recognition project

  2. Label some entities image

  3. Export the data in the "default" JSON format

  4. Run the following Python code using the exported file

import json

with open("sample.html") as f:
    sample = f.read()

with open("exported_file.json") as f:
    data = json.load(f)

globalOffsets_start = data[0]["annotations"][0]["result"][0]["value"]["globalOffsets"][
    "start"
]
globalOffsets_end = data[0]["annotations"][0]["result"][0]["value"]["globalOffsets"][
    "end"
]

print(sample[globalOffsets_start: globalOffsets_end])

Expected behavior What I'd expect to see:

こんにちは

What I actually get:

YPE ht

Screenshots image

Environment (please complete the following information):

Additional context Considering minify did not change the results.

import htmlmin

minified_sample = htmlmin.minify(sample, remove_empty_space=True)

print(minified_sample[globalOffsets_start: globalOffsets_end])

image

reference issue: https://github.com/HumanSignal/label-studio/issues/2777

hogepodge commented 11 months ago

@sinchir0 Thanks for your bug report. I'm able to confirm this behavior and will file an issue with the engineering team.

yanwenjie1 commented 11 months ago

same issue +1