Closed jonorthwash closed 2 years ago
$ cat blah.att
0 1 ؟ ؟
0 1 ، ،
0 1 ؛ ؛
0 1 a a
1
$ lt-comp lr blah.att blah.bin
final@inconditional 2 3
main@standard 2 1
$ lt-print blah.bin
0 1 ؟ ؟ 0.000000
0 1 ، ، 0.000000
0 1 ؛ ؛ 0.000000
1 0.000000
--
0 1 a a 0.000000
1 0.000000
The current problem I'm having is that Arabic commas, semicolons, question marks, etc. (،, ؛ ,؟ — all in the U+0600 block) are not placed in the punctuation level of an lttoolbox transducer when converting from HFST transducers via att format.
Probably due to use of
iswpunct()
in this function: https://github.com/apertium/lttoolbox/blob/5e695022b26250e24f1235e33386a1e27e5c16e3/lttoolbox/att_compiler.cc#L375-L377Resolving #81 would probably resolve this issue as well.