Open snomos opened 8 years ago
I have now confirmed that there is something in the build setup for the nightly builds causing the regression. I fed one of the test words through hfst-ospell using a freashly built se.zhfst (built on my own OSX box today):
$ echo skánitlaš | hfst-ospell -S build/newspellers/tools/spellcheckers/fstbased/hfst/se.zhfst
"skánitlaš" is in the lexicon...
That is, the speller behaves as expected. I then copied the zhfst file from the installed msi package, and used that with the same input:
$ echo skánitlaš | hfst-ospell -S ~/se.zhfst
"skánitlaš" is NOT in the lexicon:
Corrections for "skánitlaš":
Skánitlaš 25.436646
Skániklaš 35.436646
skánálaš 35.436646
skážirlaš 35.436646
skibitlaš 35.436646
skánjalaš 35.436646
skánalaš 35.436646
s-Skánitlaš 37.436646
Skánitlaš- 20035.437500
To me this looks like a bug in the handling of flag diacritics - the downcasing of derived proper nouns is handled with flags (Skánit (place name) -> skánitlaš (derived general noun, meaning "someone from Skánit")). There are other regressions as well that point in the same direction.
I was using an hfst version from Dec. 4 to build the zhfst file. The regression is older than that, about 10 days old now.
I scanned some chats from last week, and it seems we identified the issue(s) December 1. for the first time. Given anything else seems to be identical, could it be something related to changes in Hfst before that day that only affects builds on Windows? And as mentioned above, the only common thing among all failures is the use of flag diacritics, which is an area of trouble in past hfst versions.
Below is a screenshot that displays a list of words that should be accepted, together with the lexicon version and its build date.
The North Sami speller for MS Office has a regression, in that it does not anymore (compared to last week) accept derived proper nouns with initial lower case:
These are accepted by the command line speller (hfst-ospell -S se.zhfst), but not by the MS Office speller (*.msi package).
Because of this diff, I suspect there is something with the nightly build environment that causes the issue. I have updated our test files with test cases for these words, and running "make check" on the built speller fst's should reveal issues related to the build system, if any. "make check" succeeds on my system, and should also on the build system (there are a couple of cases of known fails, but they are properly marked, so should not break the testing).
"make check" is only known to pass for SME, I have not tested the other languages yet.
There are a number of other regressions as well, and they all point in the direction of (im)proper handling of flag diacritics. It might be changes in hfst that has caused these regressions (my hfst installation is from nov. 27).