Open rggdmonk opened 1 month ago
Thank you for asking, HyperText is really confusing. Given that user selected all text in this code:
<h1>こんにちは!</h1>
<p>これはサンプルです。</p>
<h1>こんにちは!</h1>
has startOffset === 0
<p>これはサンプルです。</p>
has endOffset === 10
The value 10 is correct for the in-tag length calculation:
"これはサンプルです。".length === 10
However, this doesn't account for the content in the <h1>
tag or any characters between the tags.
b) globalOffsets:
These are closer to capturing the full text, as they consider all characters in the text nodes of the HTML.
globaloffsets: 6 - 27
This suggests the HTML is not minified and includes whitespace:<h1>
and <p>
)globalOffsets.start === 6
:
["\n", "\n", " ", " ", " ", " "]
globalOffsets.end === 27
:
["こ", "ん", "に","ち","は","!", "\n", " ", " ", " ", " ", "こ", "れ" ,"は", "サ", "ン", "プ", "ル", "で", "す", "。"]
cmd+c
:
こんにちは!
これはサンプルです。
This extracted text has a length of 17 characters (including the newline), which doesn't match either the endOffset (10) or the globalOffset range (27 - 6 = 21).
To resolve this issue, you would need to implement a method that:
There is also an important nuance: you calculate the length of こんにちは!\nこれはサンプルです。
as 18, but \n
should be a one symbol, it should not be literated. So the actual length here is 17 like stated in the answer from @makseq
Thanks! @makseq @hlomzik
As a workaround in my case. Is it possible to restrict users from producing overlapping spans while annotating?
Example (minimal): One label (green) can't overlap with itself (green).
Unfortunately, we don't have this. But overlapping regions will have overlapping offsets, so at least you can detect them and you can merge them in most cases if overlap is subtle.
Also what problems do you have with xpath+start/end? That way should be pretty stable.
Another idea: you can use minified html, then all the invisible whitespaces will go away, so global offsets will be much more intuitive.
@hlomzik
I tried mapping overlapping spans using only global offsets.
Something like:
input_intervals = [
{"text": "he", "globalOffsets": {"start": 0, "end": 2}},
{"text": "ll", "globalOffsets": {"start": 2, "end": 4}},
{"text": "hel", "globalOffsets": {"start": 0, "end": 3}},
]
expected_output = [
{"text": "hell", "globalOffsets": {"start": 0, "end": 4}},
]
But it doesn't work for cases when you have something like this:
# all this have different wild overlapping
Hello word!
Hello word!
Hello word!
Also what problems do you have with xpath+start/end? That way should be pretty stable.
If you have sample how to do it in Python, please share :) Because I tried and I got text that not matched.
Context of my task: We want to annotate .html pages with useful text content. The aim is to evaluate different text extraction pipelines vs human.
when you have something like this
Could you please share mode details? Because I see just a 3 lines of equal text, no offsets, no html
@hlomzik thanks for your help!
I merge it like this: https://gist.github.com/rggdmonk/ba3429e25abaaf6ca7eeea841ce8e9a8 (It works. Mystery with unexpected newlines and tabs solved.)
I reread the message from https://github.com/HumanSignal/label-studio/issues/6446#issuecomment-2403622829 . And I'm trying to understand how I can calculate agreement with non-strict matching. I suppose that in my task, ±5 characters on the left and right sides(from single span) is okay for an annotator to not highlight.
Can this be done off-the-shelf?
Document:
12345Awesome54321
User A highlighed:
Awesome
User B highlighed:
12345Awesome
User C highlighed:
Awesome54321
@makseq I think this question is for you
Describe the bug Hi! I don't understand how I should use
HyperText
andHyperTextLabels
for span annotation in.html
files. It seems onlytext
field work correctly.text
is not the same:for endOffset and startOffset
text length: 83 globaloffset 'length': 79 globaloffsets: 27 - 106
OUTPUT
Expected behavior I expect the offsets and global offsets to match the length of the extracted text.
Screenshots None
Environment (please complete the following information):
Additional context HTML Files were uploaded via GUI in Google Chrome. 1) Output json file small.json 2) UI config:
3) HTML samples
№1
№2