divvun / libdivvun

lib for running gramcheck and other pipelines + cli; modules for CG→spelling, CG→feedback, tagging blanks
https://giellalt.github.io/proof/gramcheck/GrammarCheckerDocumentation.html
GNU General Public License v3.0
9 stars 1 forks source link

Weights from the CG speller are integers, should be float #1

Closed snomos closed 6 years ago

snomos commented 6 years ago

With the following command:

$ echo "Nu go mii dieehttit de lea guovddáš eiseváldit, SND čađa, vuoruhan doarjaga fatnasiidda mat leat stuorát go 15 mehtar." | hfst-tokenise --giella-cg tokeniser-gramcheck-gt-desc.pmhfst | vislcg3 -g valency.bin | vislcg3 -g mwe-dis.bin | cg-mwesplit | divvun-cgspell -a se.zhfst

one gets output like this:

"<Nu>"
    "nu" Adv <W:0.0000000000>
    "nu" Pcle <W:0.0000000000>
: 
"<go>"
    "go" CS <W:0.0000000000>
    "go" Pcle Qst <W:0.0000000000>
: 
"<mii>"
    "mii" Pron Indef Sg Nom <W:0.0000000000>
    "mii" Pron Interr Sg Nom <W:0.0000000000>
    "mii" Pron Rel Sg Nom <W:0.0000000000>
    "mun" Pron Pers Pl1 Nom <W:0.0000000000>
: 
"<dieehttit>"
    "dieehttit" ?
    "diehtit" V Ind Prs Pl1 <W:15848> <WA:8848> <spelled> "<diehtit>"
    "diehtit" V Inf <W:15848> <WA:8848> <spelled> "<diehtit>"
    "diehtit" V Imprt Pl2 <W:23203> <WA:13203> <spelled> "<diehttit>"
    "diehtti" N NomAg Pl Nom <W:23203> <WA:13203> <spelled> "<diehttit>"
    "diehtit" V Der/NomAg N Pl Nom <W:23203> <WA:15203> <spelled> "<diehttit>"
    "diehtit" V Imprt Du2 <W:35301> <WA:15301> <spelled> "<diehtti>"
    "diehtti" N NomAg Sg Acc <W:35301> <WA:15301> <spelled> "<diehtti>"
    "diehtti" N NomAg Sg Nom <W:35301> <WA:15301> <spelled> "<diehtti>"
    "diehtti" N NomAg Sg Gen <W:35301> <WA:15301> <spelled> "<diehtti>"

Compare this with the weights from the following command:

$ echo dieehttit | hfst-ospell -S ../spellcheckers/fstbased/desktop/hfst/se.zhfst
"dieehttit" is NOT in the lexicon:
Corrections for "dieehttit":
diehtit    15.848633
diehttit    23.203125
diehtti    35.301758
diehttis    35.301758
diehttut    35.301758
hiehttit    35.301758

That is, it seems that the weights from the CG speller has been multiplied with 1000 and then trunkated to make them integers. This used to be the case also for the --giella-cg mode in hfst-tokenise, but now that CG supports floats in weights, there's no reason to it. And to get consistent processing it is important to keep the weights from the different processing steps at the same scale.

unhammer commented 6 years ago

Oh right, forgot that vislcg3 now supports floats :-)