divvun / libdivvun

lib for running gramcheck and other pipelines + cli; modules for CG→spelling, CG→feedback, tagging blanks
https://giellalt.github.io/proof/gramcheck/GrammarCheckerDocumentation.html
GNU General Public License v3.0
9 stars 1 forks source link

No speller applied to initial upper-case words #33

Closed snomos closed 1 year ago

snomos commented 5 years ago

It looks like the grammar checker skips the speller if the input word starts with a capital:

‘mas Gouvdageainnus eai beasa’ vs ‘mas gouvdageainnus eai beasa’

This seems a bit too restricted. Could that be changed?

unhammer commented 5 years ago
echo gouvdageainnus | hfst-ospell -l 'acceptor.default.hfst' -m 'errmodel.default.hfst' -S -n 1

"gouvdageainnus" is NOT in the lexicon:
Corrections for "gouvdageainnus":
bruvdageainnus    47.296875

echo Gouvdageainnus | hfst-ospell -l 'acceptor.default.hfst' -m 'errmodel.default.hfst' -S -n 1

"Gouvdageainnus" is NOT in the lexicon:
Corrections for "Gouvdageainnus":
Guovdageainnus    -0.703125

so hfst-ospell gives the expected results, though with differing weights

printf '"<gouvdageainnus>"\n\t"guovdageainnus" ?\n' | divvun-cgspell -n 1 -b 15 -w 5000 -l acceptor.default.hfst -m errmodel.default.hfst

"<gouvdageainnus>"
        "guovdageainnus" ?
        "geaidnu" N Sem/Route Sg Loc <W:47.2969> <WA:27.2969> <spelled> "<bruvdageainnus>"
                "bruvda" N Sem/Dummytag Cmp/SgNom Cmp
        "geainnus" N Sem/Route Sg Nom <W:47.2969> <WA:27.2969> <spelled> "<bruvdageainnus>"
                "bruvda" N Sem/Dummytag Cmp/SgNom Cmp
        "geaidnu" N Sem/Route Sg Gen PxSg3 <W:47.2969> <WA:30.2969> <spelled> "<bruvdageainnus>"
                "bruvda" N Sem/Dummytag Cmp/SgNom Cmp
        "geaidnu" N Sem/Route Sg Acc PxSg3 <W:47.2969> <WA:30.2969> <spelled> "<bruvdageainnus>"
                "bruvda" N Sem/Dummytag Cmp/SgNom Cmp

and lowercased, cgspell is giving the suggestion

printf '"<Gouvdageainnus>"\n\t"Guovdageainnus" ?\n' | divvun-cgspell -n 1 -b 15 -w 5000 -l acceptor.default.hfst -m errmodel.default.hfst

"<Gouvdageainnus>"
    "Guovdageainnus" ?

the invocation from smegram.mode gives nothing, but it has this -b 15 there that I don't know how exactly works; if we change that

printf '"<Gouvdageainnus>"\n\t"Guovdageainnus" ?\n' | divvun-cgspell -n 1 -b 20 -w 5000 -l acceptor.default.hfst -m errmodel.default.hfst

"<Gouvdageainnus>"
    "Guovdageainnus" ?
    "Guovdageaidnu" N Prop Sem/Plc Sg Loc <W:-0.703125> <WA:17.2969> <spelled> "<Guovdageainnus>"

it seems to give the expected result – does it run much slower?

snomos commented 5 years ago

The -b option (short for beam ) sets a limit on the max weight difference between the best and the worst suggestions, 15 with the original setting. What I don't get is that the weight -0.703125from both hfst-ospell and divvun-cgspell is by far much lower than anything, and the distance to the next suggestions is much more than 15.

Could it be that there is a bug with the mathematics somewhere, such that negative weights are not properly handled?

Setting -b = 20 should be no problem though.

unhammer commented 5 years ago

So not beam as in https://en.wikipedia.org/wiki/Beam_search ? (That would actually explain it)

snomos commented 5 years ago

Not as I have understood it. But the whole beam search option was something added by S Hardwick, you better ask him for the technical details :)

snomos commented 2 years ago

This is still a problem. Here are some more examples:

echo Servodaas | ./modes/trace-smegramrelease.mode 
"<Servodaas>"
    "Servodaas" N Prop Sem/Plc Sg Loc Guess <LastCohort> <firstCohort> @HNOUN SUBSTITUTE:3417 MAP:23080:hnounAdvl
:\n

echo servodaas | ./modes/trace-smegramrelease.mode 
"<servodaas>"
    "servodaas" ? <LastCohort> <firstCohort> &typo ADD:10126:uncorrected-typos
typo
:\n

Compare with two different spellers, with both initial upper and lower case:

echo Servodaas | divvunspell suggest -a tools/spellcheckers/se-desktop.zhfst 
Reading from stdin...
Input: Servodaas        [INCORRECT]
Servodagas      48.59303
Servvodagas     66.203186
Servotbas       78.3018
Serrodagas      80.3018
Servodatbas     80.3018
Servošabas      80.3018
Servodaga       83.17057
Servodat        84.31137
Servodagat      85.399826
Servobas        92.3018

echo servodaas | divvunspell suggest -a tools/spellcheckers/se-desktop.zhfst 
Reading from stdin...
Input: servodaas        [INCORRECT]
servodagas      33.59303
servvodagas     51.203186
servotbas       63.3018
serrodagas      65.3018
servodatbas     65.3018
servošabas      65.3018
servodaga       68.17057
servodat        69.31137
servodagat      70.399826
servobas        77.3018

echo '5 Servodaas' | hfst-ospell-office tools/spellcheckers/se-desktop.zhfst 
@@ hfst-ospell-office is alive
&   Servvodagas Servodagas  Servodaga   Servodat    Servodagat
echo '5 servodaas' | hfst-ospell-office tools/spellcheckers/se-desktop.zhfst 
@@ hfst-ospell-office is alive
&   servodagas  servvodagas servodaga   servodat    servodagat

That is, the spellers have no problems giving reasonable suggestions, but nothing pops up in the grammar checker.

unhammer commented 2 years ago

I can't reproduce – was this fixed in a different issue?

$ echo Servodaas | ./modes/trace-smegramrelease.mode 
"<Servodaas>"
        "servodat" v1 N Sem/Org Sg Loc <W:48.2094> <WA:8.20939> <spelled> "servodagas"S PROTECT:3480 SELECT:3715 &SUGGESTWF &typo ADD:10118:spelled
typo
;       "servodat" v1 N Sem/Org Sg Gen PxSg3 <W:48.2094> <WA:21.2094> <spelled> "servodagas"S PROTECT:3480 SELECT:3715 REMOVE:1296
;       "servodat" v1 N Sem/Org Sg Acc PxSg3 <W:48.2094> <WA:21.2094> <spelled> "servodagas"S PROTECT:3480 SELECT:3715 REMOVE:1296
;       "Servodaas" N Prop Sem/Plc Sg Loc Guess <LastCohort> <firstCohort> SUBSTITUTE:3423 SELECT:3715
:\n
$ echo Servodaas|divvun-checker -l se |jq .
{
  "errs": [
    [
      "Servodaas",
      0,
      9,
      "typo",
      "Ii leat sátnelisttus",
      [
        "Servodagas"
      ],
      "Čállinmeattáhus"
    ]
  ],
  "text": "Servodaas"
}
$ echo mas Gouvdageainnus eai beasa|divvun-checker -l se |jq .
{
  "errs": [
    [
      "Gouvdageainnus",
      4,
      18,
      "typo",
      "Ii leat sátnelisttus",
      [
        "Guovdageainnus",
        "Govdageainnus",
        "Ovdageainnus",
        "Ruvdageainnus",
        "Hoavdageainnus",
        "Bruvdageainnus",
        "Buvdageainnus",
        "Soundageainnus"
      ],
      "Čállinmeattáhus"
    ]
  ],
  "text": "mas Gouvdageainnus eai beasa"
}
snomos commented 1 year ago

This seems to be fixed, I get the same results as you:

echo Servodaas | divvun-checker -a se.zcheck | jq .
{
  "errs": [
    [
      "Servodaas",
      0,
      9,
      "typo",
      "Ii leat sátnelisttus",
      [
        "Servodagas"
      ],
      "Čállinmeattáhus"
    ]
  ],
  "text": "Servodaas"
}

And:

echo mas Gouvdageainnus eai beasa | divvun-checker -a se.zcheck | jq .
{
  "errs": [
    [
      "Gouvdageainnus",
      4,
      18,
      "typo",
      "Ii leat sátnelisttus",
      [
        "Guovdageainnus",
        "Bruvdageainnus",
        "Soundageainnus",
        "Buvdageainnus",
        "Ruvdageainnus",
        "Govdageainnus",
        "Hoavdageainnus",
        "Ovdageainnus"
      ],
      "Čállinmeattáhus"
    ]
  ],
  "text": "mas Gouvdageainnus eai beasa"
}