apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

Allow escaping any character as opposed to only a predefined list of escaped characters #103

Closed hectoralos closed 4 years ago

hectoralos commented 4 years ago

Today, after several weeks, I ran testvoc.sh in apertium-fra-frp and I got this crash. I am not able to interpret its message:

./testvoc.sh -e
== Arpitan > Français =========================
Error: Malformed input stream./tmp/testvoc.h2cksYCTl7r: line 1:  1973 La canonada s’ha trencat  apertium-pretransfer
      1974 Violació de segment    (la imatge del nucli ha estat bolcada) | lt-proc -b '/home/hector/apertium/apertium-fra-frp/frp-fra.autobil.bin'

For sure, the problem is related with the previous issues, although the language pair is another and transfer rules are quite different to apertium-fra-cat, which caused the issue apertium/apertium#96 and maybe also apertium/apertium#95 (fra-cat is has a one-step transfer while fra-frp has 3+ steps).

unhammer commented 4 years ago

if lt-proc says the input stream is malformed, it could be that the step before lt-proc is giving something bad. It would be helpful to see exactly where this happens, though I know some of the testvoc scripts are kind of hairy …

hectoralos commented 4 years ago

Thanks @unhammer . This helped me see that (at least) this is causing the crash:

echo "[\^Chemin\# de Sent-Jaque<np><top><f><sg>$]Voie lactée ~." | lt-proc -p '/home/hector/apertium/apertium-fra-frp/frp-fra.autopgen.bin'

It seems that # in the lemma (an error) is causing the crash.

unhammer commented 4 years ago

hm, but that's in a superblank, should've been ignored

khannatanmai commented 4 years ago

So the problem can be found in https://github.com/apertium/lttoolbox/blob/cb9f86e3ad08498db8cf1dff254047928ac3ceb1/lttoolbox/fst_processor.cc#L40.

There is a predefined list of characters that should be escaped, and # is not one of them. While ideally blanks should be read just from [ to ], the code is written such that the escaped characters need to be one of these. See: https://github.com/apertium/lttoolbox/blob/cb9f86e3ad08498db8cf1dff254047928ac3ceb1/lttoolbox/fst_processor.cc#L229 . Error because of: https://github.com/apertium/lttoolbox/blob/cb9f86e3ad08498db8cf1dff254047928ac3ceb1/lttoolbox/fst_processor.cc#L204.

Just don't escape the # and it fixes. Or, if this # was escaped by the deformatter, then we can add the # to the escaped chars list. OR, since the deformatter is being remade, we can just stop escaping it.

TinoDidriksen commented 4 years ago

I would say that's a bug. That's an incorrect way to implement reading escaped input. Too much escaping must not be an error or even a warning.

khannatanmai commented 4 years ago

I agree. I don't know why it has to be strict with escaping. It's not even consistent with most of the modules. Although with these things I just assume whoever coded it had a use case in their mind so maybe they can tell us.

unhammer commented 4 years ago

I would say that's a bug. That's an incorrect way to implement reading escaped input. Too much escaping must not be an error or even a warning.

+1

khannatanmai commented 4 years ago

I checked the code and the variable escaped_chars is being used a lot throughout the code for parsing and a lot of other stuff. Might not be trivial to fix.

TinoDidriksen commented 4 years ago

Using it for which characters to escape in output is fine. Using it for which to unescape in input is not. Can't be that hard to change all the input places.

khannatanmai commented 4 years ago

Hmm so build up the list of escaped_chars as the input is being read?