apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

non-ASCII punctuation not recognised as such #85

Closed jonorthwash closed 2 years ago

jonorthwash commented 4 years ago

The current problem I'm having is that Arabic commas, semicolons, question marks, etc. (،, ؛ ,؟ — all in the U+0600 block) are not placed in the punctuation level of an lttoolbox transducer when converting from HFST transducers via att format.

Probably due to use of iswpunct() in this function: https://github.com/apertium/lttoolbox/blob/5e695022b26250e24f1235e33386a1e27e5c16e3/lttoolbox/att_compiler.cc#L375-L377

Resolving #81 would probably resolve this issue as well.

mr-martian commented 2 years ago
$ cat blah.att
0   1   ؟   ؟
0   1   ،   ،
0   1   ؛   ؛
0   1   a   a
1
$ lt-comp lr blah.att blah.bin
final@inconditional 2 3
main@standard 2 1
$ lt-print blah.bin
0   1   ؟   ؟   0.000000    
0   1   ،   ،   0.000000    
0   1   ؛   ؛   0.000000    
1   0.000000
--
0   1   a   a   0.000000    
1   0.000000