TinoDidriksen / spellers

Front-ends and packaging scripts for spellers. Git read-only mirror.
GNU General Public License v3.0
1 stars 0 forks source link

Misspellings in all uppercase get unexpected suggestions #11

Closed snomos closed 8 years ago

snomos commented 8 years ago

Example (none of the suggestions are reasonable corrections):

skjermbilde 2015-11-19 kl 16 53 43

Here is the same input with initial cap (second and third suggestions are reasonable corrections):

skjermbilde 2015-11-25 kl 12 54 24

And the same input with all lowercase (all suggestions are reasonable corrections):

skjermbilde 2015-11-25 kl 12 53 23

It might be that this can all be corrected in the fst by giving higher weights to certain types of compounds. As for now, one of the uppercase only suggestions is analysed as follows:

$ echo Kant-RV-irgi | hfst-lookup -q build/newspellers/tools/spellcheckers/fstbased/analyser-fstspeller-gt-norm.hfst 
Kant-RV-irgi    Kant+N+Prop+Sem/Sur+Cmp-#RV+N+ACR+Cmp-#irgi+N+Sg+Nom    17,031221
Kant-RV-irgi    Kant+N+Prop+Sem/Sur+Cmp-#RV+N+ACR+Cmp-#irgi+N+Sg+Nom    10017,031250

Giving higher weight to +ACR tags should help improve the suggestions. I'll try this first.

snomos commented 8 years ago

Adding more weight did not really change anything. Is there anything that can be done to the case handling algorithm to improve suggestions in these cases?

snomos commented 8 years ago

Given suggestion vs expected suggestion:

$ echo Kant-RV-irgi | hfst-lookup -q fstbased/analyser-fstspeller-gt-norm.hfst 
Kant-RV-irgi    Kant+N+Prop+Sem/Sur+Cmp-#RV+N+ACR+Cmp-#irgi+N+Sg+Nom    517,031250
Kant-RV-irgi    Kant+N+Prop+Sem/Sur+Cmp-#RV+N+ACR+Cmp-#irgi+N+Sg+Nom    10517,031250

$ echo kánturvirgi | hfst-lookup -q fstbased/analyser-fstspeller-gt-norm.hfst 
kánturvirgi    kantuvra+N+Cmp#virgi+N+Sg+Nom   15,031221

Note the weight differences.

TinoDidriksen commented 8 years ago

Fixed in hfst-ospell r4554. It actually worked as expected if you just asked for suggestions right away, but the caching of non-suggestion lookups broke it.