apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

Fix the out of alphabet token handling in analyses generation #52

Closed AMR-KELEG closed 5 years ago

AMR-KELEG commented 5 years ago

Solves #45 Consider alphanumeric characters to be part of the vocabulary.

unhammer commented 5 years ago

This looks like a major change to tokenisation. Has this been tested with several language pairs to ensure no regressions?

AMR-KELEG commented 5 years ago

This looks like a major change to tolenisation. Had this been tested with several language pairs to ensure no regressions? Den Lau 11 mai 2019, klokka 09:02, skreiv Tino Didriksen: Merged #52 <#52> into master. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#52 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAN4JBSR6RDBCAE4ZIG5QDPUZ4SHANCNFSM4HMGZ6GA.

I haven't tested it on large language-pairs. Do you have recommendations for doing so? What should the input and the expected output be?

TinoDidriksen commented 5 years ago

I just noted the build checks passed - that's good enough for me to merge it. If this breaks downstream pairs, revert and add a relevant build test.

unhammer commented 5 years ago

I haven't seen any regressions in nno-nob at least (240k lines passed without changes to output). I suppose it'll affect pairs with missing <alphabet> members more (in which case it's probably a change we want).


The reason I asked is that it takes away some freedom in defining tokenisation, e.g. with an empty alphabet you could define a very stupid tokeniser for languages without spaces (thai):

$ echo nullein | ~/src/ap/lttoolbox/lttoolbox/lt-proc /tmp/foo.bin # before this commit
^null/null<det><qnt><un><pl>$^ein/ein<det><qnt><m><sg>$

$ echo nullein|/usr/bin/lt-proc /tmp/foo.bin  # after this commit
^nullein/*nullein$

(ie. it could analyse with no spaces between LU's even though they're in a type="standard" section) but I don't think anyone's seriously doing that since LRLM fails on anything non-trivial. I guess if there actually are breakages people will complain :)

unhammer commented 5 years ago

I guess if there actually are breakages people will complain :)

And they do! See https://github.com/apertium/lttoolbox/issues/75