divvun / libdivvun

lib for running gramcheck and other pipelines + cli; modules for CG→spelling, CG→feedback, tagging blanks
https://giellalt.github.io/proof/gramcheck/GrammarCheckerDocumentation.html
GNU General Public License v3.0
9 stars 1 forks source link

divvun-blanktag needs symbols for end-of-stream and start-of-stream #12

Closed snomos closed 5 years ago

snomos commented 6 years ago

The following command:

echo 'Tabealla 1 čájeha ahte geatkemáddodat lea mealgat badjel máddodatmeari guovlluin 5, 6, 7 ja 8.    Romsa fylkkas ja Finnmárkkus leat measta golmma geardde eanet geatkkit 2012s go máddodatmearis leat.' | divvun-checker -a tools/grammarcheckers/se.zcheck -n smegram

gives the following JSON output:

{"errs":[["Romsa",98,103,"double-space-before","[SE] Double space",[]],["2012s",169,174,"typo","Čállinmeattáhus",["2012:s"]],[".",197,198,"no-space-after-punct-mark","[SE] Missing space",[]]],"text":"Tabealla 1 čájeha ahte geatkemáddodat lea mealgat badjel máddodatmeari guovlluin 5, 6, 7 ja 8.    Romsa fylkkas ja Finnmárkkus leat measta golmma geardde eanet geatkkit 2012s go máddodatmearis leat."}

Note the error message for the final full stop. The regex that triggers this error is:

[ ?*        {"<,>"}          ]:[ "<NoSpaceAfterPunctMark>"]

in the file sme/tools/grammarcheckers/analyser-gt-whitespace.regex. The regex works fine in all other cases. How can we avoid that it matches end-of-paragraph full stops?.

unhammer commented 6 years ago
[ ?*        {"<,>"}          ]:[ "<NoSpaceAfterPunctMark>"]

should only match commas?

But I see the issue. This may require a change to divvun-blanktag itself, perhaps a special symbol for end-of-stream (not just end-of-paragraph – that would give a :\n or :</p> or similar).

snomos commented 5 years ago

This is also true for the beginning of the stream, cf the following:

echo 'ja (Lauvås/Handal, s. 159)' | tools/grammarcheckers/modes/smegramrelease.mode
"<ja>"
    "ja" CC <W:0.0> @CVP
: 
"<(>"
    "(" PUNCT LEFT <W:0.0>
"<Lauvås>"
    "Lauvås" N <NomGenSg> Prop Sem/Sur Sg Nom <W:0.0> @HNOUN
"</>"
    "/" PUNCT <W:0.0>
"<Handal>"
    "Handal" N <NomGenSg> Prop Sem/Sur Sg Nom <W:0.0> @<SPRED
"<,>"
    "," CLB <W:0.0>
: 
"<s.>"
    "s" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> @HNOUN
: 
"<159>"
    "159" Num Arab Sg Nom <W:0.0> @N<
"<)>"
    ")" PUNCT RIGHT <W:0.0> <LastCohortOfParagraph>
:\n

Expected: the first cohort should have had the tag <firstWordOfParagraph>, cf the following regex in analyser-gt-whitespace.regex:

[ {\n} ?* {"<} ?* {>"}  ?*      ]:[ "<firstWordOfParagraph>"  ]