strange lt-tmxproc number copying

apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium

http://wiki.apertium.org/wiki/Lttoolbox

GNU General Public License v2.0

18 stars 22 forks source link

strange lt-tmxproc number copying #191

Closed unhammer closed 2 weeks ago

unhammer commented 2 weeks ago

$ cat test.tmx
<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4">
  <header
    creationtool="foo"
    creationtoolversion="1.0"
    segtype="phrase"
    o-tmf="tmx"
    adminlang="nb-NO"
    srclang="nb-NO"
    datatype="plaintext"
  />
  <body>
    <tu>
      <tuv xml:lang="nob">
        <seg>foo 1</seg>
      </tuv>
      <tuv xml:lang="nno">
        <seg>foo 1</seg>
      </tuv>
    </tu>
  </body>
</tmx>

$ lt-tmxcomp nob-nno test.tmx test.tmx.bin
nob->nno 9 8
$ echo '4 foo 1'| lt-tmxproc -s test.tmx.bin
4 [foo 4]
$ echo 'foo 4'| lt-tmxproc -s test.tmx.bin
[foo 4]
$ echo '5 foo 4'| lt-tmxproc -s test.tmx.bin
5 [foo 5]
$ echo '5 foo 1'| lt-tmxproc -s test.tmx.bin
5 [foo 5]

why does it match numbers that are not 1, and why does it "copy" previously seen numbers? 😵‍💫

unhammer commented 2 weeks ago

So apparently the tmx handling has this fancy feature for aligning translations even if there are numbers that might differ:

$ lt-print test.tmx.bin
0       1       f       f       0.000000
1       2       o       o       0.000000
2       3       o       o       0.000000
3       4                       0.000000
4       5       <n>     @       0.000000
5       6       ε       (       0.000000
6       7       ε       1       0.000000
7       8       ε       )       0.000000
8       0.000000

https://github.com/apertium/lttoolbox/blob/39db772f42e2c7599571a48285bfc6da163ecf0c/lttoolbox/fst_processor.cc#L281-L303

Maybe handy if you have big tmx files with things like "Stock market things went up by 999 % in Q5" and you want to use that even if they went up by just 234 %.

But it probably shouldn't copy the previous number – the alignment shouldn't look at stuff outside the matched segment. (And we may want to turn it off completely too?)

unhammer commented 2 weeks ago

https://github.com/apertium/lttoolbox/blob/39db772f42e2c7599571a48285bfc6da163ecf0c/lttoolbox/state.cc#L714-L749

:scream:

unhammer commented 2 weeks ago

This: fragment[i] = fragment[i].substr(0, j) + numbers[num - 1]; assumes all numbers are only those that have been matched by the fragment, but the processor also adds preceding numbers. The numbers vector is only cleared on a match, but should have been cleared before a match starts too.