GrammarSoft / cg3

Tools for the 3rd edition of the Constraint Grammar formalism.
https://visl.sdu.dk/cg3.html
GNU General Public License v3.0
19 stars 7 forks source link

cg-mwesplit adds extra newline #134

Open snomos opened 10 months ago

snomos commented 10 months ago

Cf the following (using giellalt/lang-sme as example):

echo 'Jođiheaddji guovttosges' | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst 
"<Jođiheaddji guovttosges>"
    "ges" Pcle Foc/ges <W:0.0> "<ges>"
        "jođiheaddji guovttos" N Coll Sem/Group_Hum Sg Loc <W:0.0> "<Jođiheaddji guovttos>"
    "ges" Pcle Foc/ges <W:0.0> "<ges>"
        "jođiheaddji guovttos" N Coll Sem/Group_Hum Sg Nom <W:0.0> "<Jođiheaddji guovttos>"
:\n
'Jođiheaddji guovttosges' | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst | cg-mwesplit 
"<Jođiheaddji guovttos>"
    "jođiheaddji guovttos" N Coll Sem/Group_Hum Sg Loc <W:0.0>
    "jođiheaddji guovttos" N Coll Sem/Group_Hum Sg Nom <W:0.0>
"<ges>"
    "ges" Pcle Foc/ges <W:0.0>
:\n

After cg-mwesplit has been applied, there is an extra newline after the split cohorts that was not there in the input. Do you get the same, @unhammer ?

unhammer commented 10 months ago

Yes – this also happens with plain vislcg3 (which typically runs before cg-mwesplit; they use the same underlying CG stream processing code):

$ echo 'makkár nu' | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst  | grep -c '^$'
0
$ echo 'makkár nu' | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst  |vislcg3  -g mwe-dis.cg3 | grep -c '^$'
1

But where does it matter? (Don't all the plugins use the json output format?)

snomos commented 8 months ago

It just feels "dirty" - the stream is changed in unintended ways. There also was a use case I had in mind when I reported this, but that is a long time ago, and now forgotten. Will add it if/when I remember what it was.