Open jerinphilip opened 2 years ago
Right now annotations are stored in a 0..a..b..N
kind of way, where 0..a
is the first token, a..b
the second, etc. For HTML tags it would work if each of those tokens could have a prefix, e.g. 0..A..a..B..b..C..N
where 0..A
and a..B
(and C..N
!) would be token prefixes in a way? That's how I treat them in HTML.cpp (specifically TokenFormatter) already. These prefixes could be empty of course if there is no HTML.
However, there are also cases where HTML replaces text, e.g. Crime & Punishment
becomes Crime & Punishment
. Those cases could not be covered by this. So the the alignment scores should stay with the actual tokens can sometimes only be achieved if the token itself is changed for its HTML counterpart.
It appears that HTML is inserting itself into tokens modifying
ByteRange
s inAnnotation
(it is expected to adjust offsets, but not ideally in add more characters).I think @jelmervdl was faced with modifying
Annotation
as a whole to remove the "ByteRanges should be contiguous, ie first.end == second.first".To be more specific, the alignment scores should stay with the actual tokens, not the tokens appended or prepended with HTML tags. Going from former to the latter is possible at a client, while the inverse operation is not. We are thus providing richer, and more authentic which is not possible using
Annotation
while the constraint of continuity holds.