[X] Ambiguous multiword expessions with ambiguous tokenisation
Seems to work – represented within lexc now; hfst-tokenise also
supports forms on the analyses now
[X] Ambiguous multiword expessions need reorganising after CG
The module cg-mwesplit takes wordforms from readings and turns them into
new cohorts
[X] Unknown words
The set-difference method only works for words without
flag diacritics (even though we should be working only on the form-side?)
and leads to binary blow-up: With only lower unknowns, we get 45M;
lower+upper gives 67M, while no unknowns gives 27M
Fixed instead by treating empty analyses as unknown-tokens in
hfst-tokenise, and outputting unmatched strings with a prefix
[ ] Treat input that's within superblanks as unmatched
probably requires a change in hfst-tokenise itself
[X] Try >1 space for ambiguous MWE's? – represented within lexc now
[ ] Try set-difference-unknowns method with regular hfst commands?
Moved here from top of gramcheck tokeniser header.
Issues:
Moved here from top of gramcheck tokeniser header.
@unhammer, @lynnda-hill - til info