Closed Trondtr closed 2 years ago
It seems like a bug I worked on with sme but maybe not relevant to web? The generator to use is probably ../../src/generator-gramcheck-gt-norm.hfst
, I tested teh word like so:
$ hfst-invert generator-gramcheck-gt-norm.hfst -o foo
$ hfst-lookup foo
hfst-lookup: warning: It is not possible to perform fast lookups with foma format automata.
Using HFST basic transducer format and performing slow lookups
> muotâsäänih
muotâsäänih muotâ+N+Sem/Substnc_Wthr+Cmp#sääni+v1+N+Sem/Cat+Pl+Nom 0,000000
muotâsäänih muotâ+N+Sem/Substnc_Wthr+Cmp#sääni+v1+N+Pl+Nom 0,000000
muotâsäänih muotâ+N+Sem/Substnc_Wthr+Cmp#sääni+N+Sem/Cat+Pl+Nom 0,000000
muotâsäänih muotâ+N+Sem/Substnc_Wthr+Cmp#sääni+N+Pl+Nom 0,000000
muotâsäänih muotâ+N+Cmp#sääni+v1+N+Sem/Cat+Pl+Nom 0,000000
muotâsäänih muotâ+N+Cmp#sääni+v1+N+Pl+Nom 0,000000
muotâsäänih muotâ+N+Cmp#sääni+N+Sem/Cat+Pl+Nom 0,000000
muotâsäänih muotâ+N+Cmp#sääni+N+Pl+Nom 0,000000
there's some +Cmp/filtering before the
The analyser I used here was only in order to demonstrate that (some) fst was able to generate it. The grc pipeline itself is the standard one. Your test goes the wring way, btw. The input I glued in contained only new-style compounds over two lines, but at some point I got the compounding in one line. Now it only is:
Lexicalised compound, everything works:
"árvusääni" v1 N <smn> <smn> Sem/Sign Pl <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 Nom &msyn-extsubj-ill-nom &SUGGEST ADD:2572:msyn-extsubj-ill-nom-pl COPY:2573:msyn-extsubj-ill-nom
árvusääni+v1+N+Pl+Nom árvusäänih
(...)
Non-lexicalised distribute words over several lines and do not work:
"<muotâsaanijd>"
"sääni" v1 N <smn> <smn> Sem/Cat Pl Ill <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 &msyn-extsubj-ill-nom ADD:2572:msyn-extsubj-ill-nom-pl
"muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
msyn-extsubj-ill-nom
"sääni" v1 N <smn> <smn> Sem/Cat Pl <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 Nom &msyn-extsubj-ill-nom &SUGGEST ADD:2572:msyn-extsubj-ill-nom-pl COPY:2573:msyn-extsubj-ill-nom
"muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom ?
"sääni" v1 N <smn> <smn> Sem/Cat Pl <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 Nom &msyn-extsubj-ill-nom ADD:2572:msyn-extsubj-ill-nom-pl COPY:2621:msyn-extsubj-ill-nom
"muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
msyn-extsubj-ill-nom
"sääni" v1 N <smn> <smn> Sem/Cat Pl <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 Nom Nom &msyn-extsubj-ill-nom ADD:2572:msyn-extsubj-ill-nom-pl COPY:2573:msyn-extsubj-ill-nom COPY:2621:msyn-extsubj-ill-nom
"muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
msyn-extsubj-ill-nom
; "sääni" v1 N Sem/Cat Pl Acc <W:0.0> REMOVE:4913:IllNotAcc
; "muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
The error thus seems to be systematic, that compounds in the new format of smeared across two lines cannot be generated by our grc ruleset (?)
So, it seems Flammie's suggestion can be implemented in one of two ways:
../../src/generator-gramcheck-gt-norm.hfst
../../src/generator-gramcheck-gt-norm.hfst
As it is the grammar checkers suffer from not being able to suggest dynamic compounds. It seems the fix is within reach, one way or another.
generator-gramcheck-gt-norm.hfst
is the generator to use, cf the file name. Is that not the one being used now? It should be built and be available in the tools/grammarchecker/
dir.
Yes, it is the one that is used. It differs from the generator-gt- in that is does not work, hence the bug:
correct fst, wrong result:
uit-mac-443:grammarcheckers ttr000$ e muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom|hfst-lookup -q ../../src/generator-gramcheck-gt-norm.hfstol
muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom+? inf
wrong fst, correct result:
uit-mac-443:grammarcheckers ttr000$ e muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom|hfst-lookup -q ../../src/generator-gt-norm.hfstol
muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom muotâsäänih 0,000000
The same result (= no result) I get when I use the fst in the grammarchecker folder:
uit-mac-443:grammarcheckers ttr000$ e muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom|hfst-lookup -q generator-gramcheck-gt-norm.hfstol
muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom+? inf
afaics everything is working as designed, either the cg rules need to be changed to get rid of +Cmp/SgNom (possibly others) that is not in gramchk generator or the filters of gramchecker generator need to be made more lenient.
The way I usually debug missing generator issues is to use the fst upside down, with the utility hfst-flookup
. You then input the word form you expect to be generated, and in return you get the analysis that will generate it. Then compare lemmas and tags, and see if there are differences. Usually there are.
NB! hfst-flookup
does not handle the optimised lookup format, so if needed reconvert to the standard Hfst format.
This is not a general issue with the grammar checker, it is specific to https://github.com/giellalt/lang-smn. Please continue the discussion in a new issue report over there.
smn grc is able to detect errors with dynamic compounds, but (it seems) not to suggest corrections. The tag string needed for it is present, though.
Here is the analysis:
Replacing dynamic compound
muotâsaanijd
with plainsaanijd
does the trick:But the analysis string offered for
muotâsaanijd
is actually capable of generating the wanted form. The test gives:muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom ?
But sending the same string to the generator gives the correct result (
muotâsäänih
):So how come the grammarchecker cannot do the same?