HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format
https://labelstud.io
Apache License 2.0
19.37k stars 2.4k forks source link

Offsets in HTML annotation #6446

Open rggdmonk opened 1 month ago

rggdmonk commented 1 month ago

Describe the bug Hi! I don't understand how I should use HyperText and HyperTextLabels for span annotation in .html files. It seems only text field work correctly.

  1. length of the text is not the same:
    
    # for globalOffsets
    text length: 83
    globaloffsets: 27 - 106
    globaloffset 'length': 79

for endOffset and startOffset

text length: 83 globaloffset 'length': 79 globaloffsets: 27 - 106

But according to this, it should be the same?
https://github.com/HumanSignal/label-studio/blob/develop/web/libs/editor/src/tags/object/RichText/domManager.md#content-field

2. In my case, annotations can overlap. Therefore, without reliable offsets, it's impossible to merge them accurately.
3. According to #4843 , it's impossible to "map" `globalOffsets` to text.
4. There are several issues https://github.com/HumanSignal/label-studio/issues?q=globalOffsets
5. I also tried to use xpath `start` and `end` -- same result i can't get same text. Do you have any example how reuse it?

**To Reproduce**
```python
from __future__ import annotations

import json
import logging
import pathlib

def debug_annotation(json_path: str, debug_level: int = logging.DEBUG) -> None:
    # create logger
    logger = logging.getLogger(__name__)
    logger.setLevel(debug_level)
    logger.addHandler(logging.StreamHandler())

    # read json file
    with pathlib.Path(json_path).open("r", encoding="utf-8") as json_file:
        json_data = json.load(json_file)

    for task in json_data:
        logger.info("Task ID: %d", task["id"])
        for annotation in task["annotations"]:
            logger.info("Annotation ID: %d", annotation["id"])
            for item in annotation["result"]:
                if len(item["value"]["text"]) != item["value"]["endOffset"] - item["value"]["startOffset"]:
                    logger.critical(
                        "LEN OF TEXT NOT MATCHED (`end-start` offsets): Task ID: %d, Annotation ID: %d, Result ID: %s",
                        task["id"],
                        annotation["id"],
                        item["id"],
                    )
                    logger.critical("Text: %s", item["value"]["text"])
                    logger.critical("text length: %d", len(item["value"]["text"]))
                    logger.critical("offset 'length': %d", item["value"]["endOffset"] - item["value"]["startOffset"])
                    logger.critical("offsets: %d - %d", item["value"]["startOffset"], item["value"]["endOffset"])
                    logger.critical("")

                if (
                    len(item["value"]["text"])
                    != item["value"]["globalOffsets"]["end"] - item["value"]["globalOffsets"]["start"]
                ):
                    logger.critical(
                        "LEN OF TEXT NOT MATCHED (`end-start` globalOffsets): Task ID: %d, Annotation ID: %d, Result ID: %s",
                        task["id"],
                        annotation["id"],
                        item["id"],
                    )
                    logger.critical("Text: %s", item["value"]["text"])
                    logger.critical("text length: %d", len(item["value"]["text"]))
                    logger.critical(
                        "globaloffset 'length': %d",
                        item["value"]["globalOffsets"]["end"] - item["value"]["globalOffsets"]["start"],
                    )
                    logger.critical(
                        "globaloffsets: %d - %d",
                        item["value"]["globalOffsets"]["start"],
                        item["value"]["globalOffsets"]["end"],
                    )
                    logger.critical("")

                logger.info("Result ID: %s", item["id"])
                logger.info("Result text: %s", item["value"]["text"])
                logger.info("startOffset: %d", item["value"]["startOffset"])
                logger.info("endOffset: %d", item["value"]["endOffset"])
                logger.info("globalOffsets (start): %d", item["value"]["globalOffsets"]["start"])
                logger.info("globalOffsets (end): %d", item["value"]["globalOffsets"]["end"])
                logger.info("")

    return None

if __name__ == "__main__":
    # see Additional context
    path_to_json = "path/to/small.json"

    debug_annotation(path_to_json, debug_level=logging.CRITICAL)

OUTPUT

LEN OF TEXT NOT MATCHED (`end-start` offsets): Task ID: 132570841, Annotation ID: 45365694, Result ID: oicRy3z29K
Text: こんにちは!\nこれはサンプルです。
text length: 18
offset 'length': 10
offsets: 0 - 10

LEN OF TEXT NOT MATCHED (`end-start` globalOffsets): Task ID: 132570841, Annotation ID: 45365694, Result ID: oicRy3z29K
Text: こんにちは!\nこれはサンプルです。
text length: 18
globaloffset 'length': 21
globaloffsets: 6 - 27

LEN OF TEXT NOT MATCHED (`end-start` offsets): Task ID: 132570841, Annotation ID: 45365694, Result ID: 5Fsy0sv91L
Text: こんにちは!\nこれはサンプルです。
text length: 18
offset 'length': 10
offsets: 0 - 10

LEN OF TEXT NOT MATCHED (`end-start` globalOffsets): Task ID: 132570841, Annotation ID: 45365694, Result ID: 5Fsy0sv91L
Text: こんにちは!\nこれはサンプルです。
text length: 18
globaloffset 'length': 21
globaloffsets: 6 - 27

LEN OF TEXT NOT MATCHED (`end-start` offsets): Task ID: 132570842, Annotation ID: 45365741, Result ID: iokeAq7ZtS
Text: This is an example paragraph using the Fira Code font which supports ligatures.\n\n
text length: 83
offset 'length': 79
offsets: 0 - 79

LEN OF TEXT NOT MATCHED (`end-start` globalOffsets): Task ID: 132570842, Annotation ID: 45365741, Result ID: iokeAq7ZtS
Text: This is an example paragraph using the Fira Code font which supports ligatures.\n\n
text length: 83
globaloffset 'length': 79
globaloffsets: 27 - 106

LEN OF TEXT NOT MATCHED (`end-start` offsets): Task ID: 132570842, Annotation ID: 45365741, Result ID: XRLK45yuq-
Text: Ligature Example\nThis is an example paragraph using the Fira Code font which supports ligatures.\n\nCommon ligatures include symbols like: ==, !=, ===, <=,>=, -->, and others.
text length: 176
offset 'length': 75
offsets: 0 - 75

LEN OF TEXT NOT MATCHED (`end-start` globalOffsets): Task ID: 132570842, Annotation ID: 45365741, Result ID: XRLK45yuq-
Text: Ligature Example\nThis is an example paragraph using the Fira Code font which supports ligatures.\n\nCommon ligatures include symbols like: ==, !=, ===, <=,>=, -->, and others.
text length: 176
globaloffset 'length': 181
globaloffsets: 5 - 186

LEN OF TEXT NOT MATCHED (`end-start` offsets): Task ID: 132570842, Annotation ID: 45365741, Result ID: RBB6g6SYJw
Text: Ligature Example\nThis is an example paragraph using the Fira Code font which supports ligatures.\n\nCommon ligatures include symbols like: ==, !=, ===, <=,>=, -->, and others.
text length: 176
offset 'length': 75
offsets: 0 - 75

LEN OF TEXT NOT MATCHED (`end-start` globalOffsets): Task ID: 132570842, Annotation ID: 45365741, Result ID: RBB6g6SYJw
Text: Ligature Example\nThis is an example paragraph using the Fira Code font which supports ligatures.\n\nCommon ligatures include symbols like: ==, !=, ===, <=,>=, -->, and others.
text length: 176
globaloffset 'length': 181
globaloffsets: 5 - 186

Expected behavior I expect the offsets and global offsets to match the length of the extracted text.

Screenshots None

Environment (please complete the following information):

Additional context HTML Files were uploaded via GUI in Google Chrome. 1) Output json file small.json 2) UI config:

<View>
    <!--  Main panel -->
    <View
        style="padding: 0 1em; margin: 1em 0; background: #f1f1f1; position: sticky; top: 0; border-radius: 3px; z-index: 100">
        <View>
            <Header value="#7 My task description:" />
            <HyperTextLabels name="spans" toName="text">
                <Label value="highlighter" background="green" hotkey="t" />
            </HyperTextLabels>
        </View>
    </View>

    <View
        style="border: 1px solid #CCC;
                   border-radius: 10px;
                   padding: 5px;">
        <HyperText name="text" value="$html" inline="false" clickableLinks="false" />
    </View>
</View>

3) HTML samples

№1

<!DOCTYPE html>
<html lang="ja">

<head>
    <meta charset="UTF-8">
    <title>サンプル</title>
</head>

<body>

    <h1>こんにちは!</h1>
    <p>これはサンプルです。</p>

</body>

</html>

№2

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>HTML Page with Ligatures</title>
    <style>
        /* Using Google Fonts to import a ligature-friendly font */
        @import url('https://fonts.googleapis.com/css2?family=Fira+Code:wght@400&display=swap');

        body {
            font-family: 'Fira Code', monospace;
            margin: 20px;
        }

        p {
            font-size: 16px;
        }
    </style>
</head>

<body>
    <h1>Ligature Example</h1>

    <p>This is an example paragraph using the Fira Code font which supports ligatures.</p>
    <p>Common ligatures include symbols like: ==, !=, ===, <=,>=, -->, and others.</p>
</body>

</html>
makseq commented 1 month ago

Thank you for asking, HyperText is really confusing. Given that user selected all text in this code:

      <h1>こんにちは!</h1>
      <p>これはサンプルです。</p>
  1. Offset Explanations: a) startOffset and endOffset: These relate to the in-tag length of content. For example:
    • <h1>こんにちは!</h1> has startOffset === 0
    • <p>これはサンプルです。</p> has endOffset === 10 The value 10 is correct for the in-tag length calculation:
      "これはサンプルです。".length === 10

      However, this doesn't account for the content in the <h1> tag or any characters between the tags. b) globalOffsets: These are closer to capturing the full text, as they consider all characters in the text nodes of the HTML.

    • globaloffsets: 6 - 27 This suggests the HTML is not minified and includes whitespace:
    • 4 spaces at the start of each line (before <h1> and <p>)
    • Some newline characters We can break it down like this:
    • For globalOffsets.start === 6:
      ["\n", "\n", " ", " ", " ", " "]
    • For globalOffsets.end === 27:
      ["こ", "ん", "に","ち","は","!", "\n", " ", " ", " ", " ", "こ", "れ" ,"は", "サ", "ン", "プ", "ル", "で", "す", "。"]
  2. Text Extraction: The actual extracted text differs from both offset calculations. It's equivalent to what you'd get when selecting the region inside a browser and pressing cmd+c:
    こんにちは!
    これはサンプルです。

    This extracted text has a length of 17 characters (including the newline), which doesn't match either the endOffset (10) or the globalOffset range (27 - 6 = 21).

  3. Cause of Discrepancies:
    • startOffset and endOffset only consider text within specific tags, not the entire document structure.
    • globalOffsets include all characters in the HTML source, including whitespace and newlines that aren't visible in the rendered output.
    • The extracted text only includes the visible, rendered content, ignoring source-specific formatting characters.
  4. Implications: These discrepancies make it challenging to accurately map offsets to the extracted text, especially for tasks like highlighting or annotating specific portions of the text.

To resolve this issue, you would need to implement a method that:

  1. Maps the globalOffsets to the extracted text.
  2. Accounts for the differences between the HTML source and the rendered output.
  3. Possibly creates a parallel index that only counts visible characters.
  4. Or implements a conversion function between the source-based offsets and the rendered text positions. This solution would need to handle various HTML structures, nested tags, and different types of whitespace to be robust and generally applicable. (edited)
hlomzik commented 1 month ago

There is also an important nuance: you calculate the length of こんにちは!\nこれはサンプルです。 as 18, but \n should be a one symbol, it should not be literated. So the actual length here is 17 like stated in the answer from @makseq

rggdmonk commented 1 month ago

Thanks! @makseq @hlomzik

As a workaround in my case. Is it possible to restrict users from producing overlapping spans while annotating?

Example (minimal): One label (green) can't overlap with itself (green).

hlomzik commented 1 month ago

Unfortunately, we don't have this. But overlapping regions will have overlapping offsets, so at least you can detect them and you can merge them in most cases if overlap is subtle.

Also what problems do you have with xpath+start/end? That way should be pretty stable.

Another idea: you can use minified html, then all the invisible whitespaces will go away, so global offsets will be much more intuitive.

rggdmonk commented 1 month ago

@hlomzik

I tried mapping overlapping spans using only global offsets.

Something like:

input_intervals = [
            {"text": "he", "globalOffsets": {"start": 0, "end": 2}},
            {"text": "ll", "globalOffsets": {"start": 2, "end": 4}},
            {"text": "hel", "globalOffsets": {"start": 0, "end": 3}},
        ]
expected_output = [
            {"text": "hell", "globalOffsets": {"start": 0, "end": 4}},
        ]

But it doesn't work for cases when you have something like this:

# all this have different wild overlapping
Hello word!
Hello word!
Hello word!

Also what problems do you have with xpath+start/end? That way should be pretty stable.

If you have sample how to do it in Python, please share :) Because I tried and I got text that not matched.

Context of my task: We want to annotate .html pages with useful text content. The aim is to evaluate different text extraction pipelines vs human.

hlomzik commented 1 month ago

when you have something like this

Could you please share mode details? Because I see just a 3 lines of equal text, no offsets, no html

rggdmonk commented 1 month ago

@hlomzik thanks for your help!

I merge it like this: https://gist.github.com/rggdmonk/ba3429e25abaaf6ca7eeea841ce8e9a8 (It works. Mystery with unexpected newlines and tabs solved.)

I reread the message from https://github.com/HumanSignal/label-studio/issues/6446#issuecomment-2403622829 . And I'm trying to understand how I can calculate agreement with non-strict matching. I suppose that in my task, ±5 characters on the left and right sides(from single span) is okay for an annotator to not highlight.

Can this be done off-the-shelf?

Document:
12345Awesome54321

User A highlighed:
Awesome

User B highlighed:
12345Awesome

User C highlighed:
Awesome54321
hlomzik commented 1 month ago

@makseq I think this question is for you