$ echo "Nu go mii dieehttit de lea guovddáš eiseváldit, SND čađa, vuoruhan doarjaga fatnasiidda mat leat stuorát go 15 mehtar." | hfst-tokenise --giella-cg tokeniser-gramcheck-gt-desc.pmhfst | vislcg3 -g valency.bin | vislcg3 -g mwe-dis.bin | cg-mwesplit | divvun-cgspell -a se.zhfst
one gets output like this:
"<Nu>"
"nu" Adv <W:0.0000000000>
"nu" Pcle <W:0.0000000000>
:
"<go>"
"go" CS <W:0.0000000000>
"go" Pcle Qst <W:0.0000000000>
:
"<mii>"
"mii" Pron Indef Sg Nom <W:0.0000000000>
"mii" Pron Interr Sg Nom <W:0.0000000000>
"mii" Pron Rel Sg Nom <W:0.0000000000>
"mun" Pron Pers Pl1 Nom <W:0.0000000000>
:
"<dieehttit>"
"dieehttit" ?
"diehtit" V Ind Prs Pl1 <W:15848> <WA:8848> <spelled> "<diehtit>"
"diehtit" V Inf <W:15848> <WA:8848> <spelled> "<diehtit>"
"diehtit" V Imprt Pl2 <W:23203> <WA:13203> <spelled> "<diehttit>"
"diehtti" N NomAg Pl Nom <W:23203> <WA:13203> <spelled> "<diehttit>"
"diehtit" V Der/NomAg N Pl Nom <W:23203> <WA:15203> <spelled> "<diehttit>"
"diehtit" V Imprt Du2 <W:35301> <WA:15301> <spelled> "<diehtti>"
"diehtti" N NomAg Sg Acc <W:35301> <WA:15301> <spelled> "<diehtti>"
"diehtti" N NomAg Sg Nom <W:35301> <WA:15301> <spelled> "<diehtti>"
"diehtti" N NomAg Sg Gen <W:35301> <WA:15301> <spelled> "<diehtti>"
Compare this with the weights from the following command:
$ echo dieehttit | hfst-ospell -S ../spellcheckers/fstbased/desktop/hfst/se.zhfst
"dieehttit" is NOT in the lexicon:
Corrections for "dieehttit":
diehtit 15.848633
diehttit 23.203125
diehtti 35.301758
diehttis 35.301758
diehttut 35.301758
hiehttit 35.301758
That is, it seems that the weights from the CG speller has been multiplied with 1000 and then trunkated to make them integers. This used to be the case also for the --giella-cg mode in hfst-tokenise, but now that CG supports floats in weights, there's no reason to it. And to get consistent processing it is important to keep the weights from the different processing steps at the same scale.
With the following command:
one gets output like this:
Compare this with the weights from the following command:
That is, it seems that the weights from the CG speller has been multiplied with 1000 and then trunkated to make them integers. This used to be the case also for the
--giella-cg
mode in hfst-tokenise, but now that CG supports floats in weights, there's no reason to it. And to get consistent processing it is important to keep the weights from the different processing steps at the same scale.