browsermt / bergamot-translator

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.
http://browser.mt
Mozilla Public License 2.0
340 stars 38 forks source link

HTML should not be annotated as tokens with alignment-scores #298

Open jerinphilip opened 2 years ago

jerinphilip commented 2 years ago

It appears that HTML is inserting itself into tokens modifying ByteRanges in Annotation (it is expected to adjust offsets, but not ideally in add more characters).

I think @jelmervdl was faced with modifying Annotation as a whole to remove the "ByteRanges should be contiguous, ie first.end == second.first".

image image

To be more specific, the alignment scores should stay with the actual tokens, not the tokens appended or prepended with HTML tags. Going from former to the latter is possible at a client, while the inverse operation is not. We are thus providing richer, and more authentic which is not possible using Annotation while the constraint of continuity holds.

jelmervdl commented 2 years ago

Right now annotations are stored in a 0..a..b..N kind of way, where 0..a is the first token, a..b the second, etc. For HTML tags it would work if each of those tokens could have a prefix, e.g. 0..A..a..B..b..C..N where 0..A and a..B (and C..N!) would be token prefixes in a way? That's how I treat them in HTML.cpp (specifically TokenFormatter) already. These prefixes could be empty of course if there is no HTML.

However, there are also cases where HTML replaces text, e.g. Crime & Punishment becomes Crime & Punishment. Those cases could not be covered by this. So the the alignment scores should stay with the actual tokens can sometimes only be achieved if the token itself is changed for its HTML counterpart.