Closed hectoralos closed 4 years ago
if lt-proc says the input stream is malformed, it could be that the step before lt-proc is giving something bad. It would be helpful to see exactly where this happens, though I know some of the testvoc scripts are kind of hairy …
Thanks @unhammer . This helped me see that (at least) this is causing the crash:
echo "[\^Chemin\# de Sent-Jaque<np><top><f><sg>$]Voie lactée ~." | lt-proc -p '/home/hector/apertium/apertium-fra-frp/frp-fra.autopgen.bin'
It seems that # in the lemma (an error) is causing the crash.
hm, but that's in a superblank, should've been ignored
So the problem can be found in https://github.com/apertium/lttoolbox/blob/cb9f86e3ad08498db8cf1dff254047928ac3ceb1/lttoolbox/fst_processor.cc#L40.
There is a predefined list of characters that should be escaped, and # is not one of them. While ideally blanks should be read just from [ to ], the code is written such that the escaped characters need to be one of these. See: https://github.com/apertium/lttoolbox/blob/cb9f86e3ad08498db8cf1dff254047928ac3ceb1/lttoolbox/fst_processor.cc#L229 . Error because of: https://github.com/apertium/lttoolbox/blob/cb9f86e3ad08498db8cf1dff254047928ac3ceb1/lttoolbox/fst_processor.cc#L204.
Just don't escape the # and it fixes. Or, if this # was escaped by the deformatter, then we can add the # to the escaped chars list. OR, since the deformatter is being remade, we can just stop escaping it.
I would say that's a bug. That's an incorrect way to implement reading escaped input. Too much escaping must not be an error or even a warning.
I agree. I don't know why it has to be strict with escaping. It's not even consistent with most of the modules. Although with these things I just assume whoever coded it had a use case in their mind so maybe they can tell us.
I would say that's a bug. That's an incorrect way to implement reading escaped input. Too much escaping must not be an error or even a warning.
+1
I checked the code and the variable escaped_chars is being used a lot throughout the code for parsing and a lot of other stuff. Might not be trivial to fix.
Using it for which characters to escape in output is fine. Using it for which to unescape in input is not. Can't be that hard to change all the input places.
Hmm so build up the list of escaped_chars as the input is being read?
Today, after several weeks, I ran testvoc.sh in apertium-fra-frp and I got this crash. I am not able to interpret its message:
For sure, the problem is related with the previous issues, although the language pair is another and transfer rules are quite different to apertium-fra-cat, which caused the issue apertium/apertium#96 and maybe also apertium/apertium#95 (fra-cat is has a one-step transfer while fra-frp has 3+ steps).