Closed AMR-KELEG closed 5 years ago
This looks like a major change to tokenisation. Has this been tested with several language pairs to ensure no regressions?
This looks like a major change to tolenisation. Had this been tested with several language pairs to ensure no regressions? Den Lau 11 mai 2019, klokka 09:02, skreiv Tino Didriksen: … Merged #52 <#52> into master. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#52 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAN4JBSR6RDBCAE4ZIG5QDPUZ4SHANCNFSM4HMGZ6GA.
I haven't tested it on large language-pairs. Do you have recommendations for doing so? What should the input and the expected output be?
I just noted the build checks passed - that's good enough for me to merge it. If this breaks downstream pairs, revert and add a relevant build test.
I haven't seen any regressions in nno-nob at least (240k lines passed without changes to output). I suppose it'll affect pairs with missing <alphabet>
members more (in which case it's probably a change we want).
The reason I asked is that it takes away some freedom in defining tokenisation, e.g. with an empty alphabet you could define a very stupid tokeniser for languages without spaces (thai):
$ echo nullein | ~/src/ap/lttoolbox/lttoolbox/lt-proc /tmp/foo.bin # before this commit
^null/null<det><qnt><un><pl>$^ein/ein<det><qnt><m><sg>$
$ echo nullein|/usr/bin/lt-proc /tmp/foo.bin # after this commit
^nullein/*nullein$
(ie. it could analyse with no spaces between LU's even though they're in a type="standard"
section) but I don't think anyone's seriously doing that since LRLM fails on anything non-trivial. I guess if there actually are breakages people will complain :)
I guess if there actually are breakages people will complain :)
And they do! See https://github.com/apertium/lttoolbox/issues/75
Solves #45 Consider alphanumeric characters to be part of the vocabulary.