TinoDidriksen / spellers

Front-ends and packaging scripts for spellers. Git read-only mirror.
GNU General Public License v3.0
1 stars 0 forks source link

Regression in derived proper nouns #14

Open snomos opened 8 years ago

snomos commented 8 years ago

The North Sami speller for MS Office has a regression, in that it does not anymore (compared to last week) accept derived proper nouns with initial lower case:

skjermbilde 2015-12-03 kl 15 15 05

These are accepted by the command line speller (hfst-ospell -S se.zhfst), but not by the MS Office speller (*.msi package).

Because of this diff, I suspect there is something with the nightly build environment that causes the issue. I have updated our test files with test cases for these words, and running "make check" on the built speller fst's should reveal issues related to the build system, if any. "make check" succeeds on my system, and should also on the build system (there are a couple of cases of known fails, but they are properly marked, so should not break the testing).

"make check" is only known to pass for SME, I have not tested the other languages yet.

There are a number of other regressions as well, and they all point in the direction of (im)proper handling of flag diacritics. It might be changes in hfst that has caused these regressions (my hfst installation is from nov. 27).

snomos commented 8 years ago

I have now confirmed that there is something in the build setup for the nightly builds causing the regression. I fed one of the test words through hfst-ospell using a freashly built se.zhfst (built on my own OSX box today):

$ echo skánitlaš | hfst-ospell -S build/newspellers/tools/spellcheckers/fstbased/hfst/se.zhfst 
"skánitlaš" is in the lexicon...

That is, the speller behaves as expected. I then copied the zhfst file from the installed msi package, and used that with the same input:

$ echo skánitlaš | hfst-ospell -S ~/se.zhfst 
"skánitlaš" is NOT in the lexicon:
Corrections for "skánitlaš":
Skánitlaš    25.436646
Skániklaš    35.436646
skánálaš    35.436646
skážirlaš    35.436646
skibitlaš    35.436646
skánjalaš    35.436646
skánalaš    35.436646
s-Skánitlaš    37.436646
Skánitlaš-    20035.437500

To me this looks like a bug in the handling of flag diacritics - the downcasing of derived proper nouns is handled with flags (Skánit (place name) -> skánitlaš (derived general noun, meaning "someone from Skánit")). There are other regressions as well that point in the same direction.

I was using an hfst version from Dec. 4 to build the zhfst file. The regression is older than that, about 10 days old now.

snomos commented 8 years ago

I scanned some chats from last week, and it seems we identified the issue(s) December 1. for the first time. Given anything else seems to be identical, could it be something related to changes in Hfst before that day that only affects builds on Windows? And as mentioned above, the only common thing among all failures is the use of flag diacritics, which is an area of trouble in past hfst versions.

Below is a screenshot that displays a list of words that should be accepted, together with the lexicon version and its build date.

skjermbilde 2015-12-10 kl 16 49 15