IRT-Open-Source / scf

Subtitling Conversion Framework
Apache License 2.0
52 stars 18 forks source link

STLXML-SplitBlocks: composite sequence spreads over different TTI blocks #59

Closed spoeschel closed 4 years ago

spoeschel commented 5 years ago

When in STLXML a character with diacritical chars is represented with two Unicode codepoints instead of a single one (e.g. , which cannot be represented with a single codepoint), it can occur that this character is located at such a bad position that STLXML-SplitBlocks will move the two codepoints into different TTI blocks.

However this will contradict the correct later conversion from STLXML to STL, as in Unicode the combining diacritical mark is a suffix. In STL in contrast, the corresponding diacritical mark is a prefix. While STLXML2STL swaps the order of such two codepoints, it cannot in the described case, as this only affects a single TTI block, but does not go beyond TTI block borders.

So this module must not store such codepoint pair in different TTI blocks.