HTML should not be annotated as tokens with alignment-scores

browsermt / bergamot-translator

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.

Mozilla Public License 2.0

340 stars 38 forks source link

Right now annotations are stored in a 0..a..b..N kind of way, where 0..a is the first token, a..b the second, etc. For HTML tags it would work if each of those tokens could have a prefix, e.g. 0..A..a..B..b..C..N where 0..A and a..B (and C..N!) would be token prefixes in a way? That's how I treat them in HTML.cpp (specifically TokenFormatter) already. These prefixes could be empty of course if there is no HTML.

However, there are also cases where HTML replaces text, e.g. Crime & Punishment becomes Crime & Punishment. Those cases could not be covered by this. So the the alignment scores should stay with the actual tokens can sometimes only be achieved if the token itself is changed for its HTML counterpart.

browsermt / bergamot-translator

HTML should not be annotated as tokens with alignment-scores #298