Open snomos opened 10 months ago
Yes – this also happens with plain vislcg3
(which typically runs before cg-mwesplit; they use the same underlying CG stream processing code):
$ echo 'makkár nu' | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst | grep -c '^$'
0
$ echo 'makkár nu' | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst |vislcg3 -g mwe-dis.cg3 | grep -c '^$'
1
But where does it matter? (Don't all the plugins use the json output format?)
It just feels "dirty" - the stream is changed in unintended ways. There also was a use case I had in mind when I reported this, but that is a long time ago, and now forgotten. Will add it if/when I remember what it was.
Cf the following (using giellalt/lang-sme as example):
After
cg-mwesplit
has been applied, there is an extra newline after the split cohorts that was not there in the input. Do you get the same, @unhammer ?