divvun / divvun-gramcheck-web

Grammar checker for web word processors, targeted at minority and indigenous languages, but open for everyone.
GNU General Public License v3.0
1 stars 0 forks source link

Grc does not suggest dynamic compounds #68

Closed Trondtr closed 2 years ago

Trondtr commented 2 years ago

smn grc is able to detect errors with dynamic compounds, but (it seems) not to suggest corrections. The tag string needed for it is present, though.

Here is the analysis:

uit-mac-443:grammarcheckers ttr000$ e Sämikielâst láá eromâš muotâsaanijd.|sh modes/trace-smngram-dev.mode 
"<Sämikielâst>"
    "sämikielâ" N <smn> <smn> Sem/Lang Sg Loc <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419
;   "kielâ" N Sem/Lang_Tool Sg Loc <W:0.0>
;       "säämi" N Sem/Hum_Lang Cmp/SgGen Cmp <W:0.0> REMOVE:2004:longest-match
;   "kielâ" N Sem/Lang_Tool Sg Loc <W:0.0>
;       "säämi" N Sem/Hum_Lang Cmp/SgNom Cmp <W:0.0> REMOVE:2004:longest-match
: 
"<láá>"
    "leđe" V <smn> <smn> IV Ind Prs Pl3 <W:0.0> MAP:4646:+FMAINVCop SUBSTITUTE:7421 @+FMAINV SUBSTITUTE:7421
: 
"<eromâš>"
    "eromâš" A <smn> <smn> Attr <W:0.0> SUBSTITUTE:7418 SUBSTITUTE:7418
    "eromâš" A <smn> <smn> Sg Gen <W:0.0> SUBSTITUTE:7418 SUBSTITUTE:7418
    "eromâš" Adv <smn> <smn> <W:0.0> SUBSTITUTE:7420 SUBSTITUTE:7420
;   "eromâš" A Sg Acc <W:0.0> REMOVE:5456:NotAcc
;   "eromâš" A Sg Nom <W:0.0> REMOVE:3821:Wr1785
: 
"<muotâsaanijd>"
    "sääni" v1 N <smn> <smn> Sem/Cat Pl Ill <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 &msyn-extsubj-ill-nom ADD:2572:msyn-extsubj-ill-nom-pl
        "muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
msyn-extsubj-ill-nom
    "sääni" v1 N <smn> <smn> Sem/Cat Pl <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 Nom &msyn-extsubj-ill-nom &SUGGEST ADD:2572:msyn-extsubj-ill-nom-pl COPY:2573:msyn-extsubj-ill-nom
        "muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom ?
    "sääni" v1 N <smn> <smn> Sem/Cat Pl <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 Nom &msyn-extsubj-ill-nom ADD:2572:msyn-extsubj-ill-nom-pl COPY:2621:msyn-extsubj-ill-nom
        "muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
msyn-extsubj-ill-nom
    "sääni" v1 N <smn> <smn> Sem/Cat Pl <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 Nom Nom &msyn-extsubj-ill-nom ADD:2572:msyn-extsubj-ill-nom-pl COPY:2573:msyn-extsubj-ill-nom COPY:2621:msyn-extsubj-ill-nom
        "muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
msyn-extsubj-ill-nom
;   "sääni" v1 N Sem/Cat Pl Acc <W:0.0> REMOVE:4913:IllNotAcc
;       "muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
"<.>"
    "." CLB <W:0.0>
:

Replacing dynamic compound muotâsaanijd with plain saanijd does the trick:

"<saanijd>"
    "sääni" v1 N <smn> <smn> Sem/Cat Pl Ill <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 &msyn-extsubj-ill-nom ADD:2572:msyn-extsubj-ill-nom-pl
msyn-extsubj-ill-nom
    "sääni" v1 N <smn> <smn> Sem/Cat Pl <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 Nom &msyn-extsubj-ill-nom &SUGGEST ADD:2572:msyn-extsubj-ill-nom-pl COPY:2573:msyn-extsubj-ill-nom
sääni+v1+N+Pl+Nom   säänih
...

But the analysis string offered for muotâsaanijd is actually capable of generating the wanted form. The test gives:

muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom ?

But sending the same string to the generator gives the correct result (muotâsäänih):

e muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom|hfst-lookup -q ../../src/generator-gt-norm.hfst
muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom muotâsäänih 0,000000

So how come the grammarchecker cannot do the same?

flammie commented 2 years ago

It seems like a bug I worked on with sme but maybe not relevant to web? The generator to use is probably ../../src/generator-gramcheck-gt-norm.hfst, I tested teh word like so:

$ hfst-invert generator-gramcheck-gt-norm.hfst -o foo
$ hfst-lookup foo
hfst-lookup: warning: It is not possible to perform fast lookups with foma format automata.
Using HFST basic transducer format and performing slow lookups
> muotâsäänih
muotâsäänih muotâ+N+Sem/Substnc_Wthr+Cmp#sääni+v1+N+Sem/Cat+Pl+Nom  0,000000
muotâsäänih muotâ+N+Sem/Substnc_Wthr+Cmp#sääni+v1+N+Pl+Nom  0,000000
muotâsäänih muotâ+N+Sem/Substnc_Wthr+Cmp#sääni+N+Sem/Cat+Pl+Nom 0,000000
muotâsäänih muotâ+N+Sem/Substnc_Wthr+Cmp#sääni+N+Pl+Nom 0,000000
muotâsäänih muotâ+N+Cmp#sääni+v1+N+Sem/Cat+Pl+Nom   0,000000
muotâsäänih muotâ+N+Cmp#sääni+v1+N+Pl+Nom   0,000000
muotâsäänih muotâ+N+Cmp#sääni+N+Sem/Cat+Pl+Nom  0,000000
muotâsäänih muotâ+N+Cmp#sääni+N+Pl+Nom  0,000000

there's some +Cmp/filtering before the

Trondtr commented 2 years ago

The analyser I used here was only in order to demonstrate that (some) fst was able to generate it. The grc pipeline itself is the standard one. Your test goes the wring way, btw. The input I glued in contained only new-style compounds over two lines, but at some point I got the compounding in one line. Now it only is:

Lexicalised compound, everything works:

    "árvusääni" v1 N <smn> <smn> Sem/Sign Pl <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 Nom &msyn-extsubj-ill-nom &SUGGEST ADD:2572:msyn-extsubj-ill-nom-pl COPY:2573:msyn-extsubj-ill-nom
árvusääni+v1+N+Pl+Nom   árvusäänih
(...)

Non-lexicalised distribute words over several lines and do not work:

"<muotâsaanijd>"
    "sääni" v1 N <smn> <smn> Sem/Cat Pl Ill <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 &msyn-extsubj-ill-nom ADD:2572:msyn-extsubj-ill-nom-pl
        "muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
msyn-extsubj-ill-nom
    "sääni" v1 N <smn> <smn> Sem/Cat Pl <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 Nom &msyn-extsubj-ill-nom &SUGGEST ADD:2572:msyn-extsubj-ill-nom-pl COPY:2573:msyn-extsubj-ill-nom
        "muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom ?
    "sääni" v1 N <smn> <smn> Sem/Cat Pl <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 Nom &msyn-extsubj-ill-nom ADD:2572:msyn-extsubj-ill-nom-pl COPY:2621:msyn-extsubj-ill-nom
        "muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
msyn-extsubj-ill-nom
    "sääni" v1 N <smn> <smn> Sem/Cat Pl <W:0.0> SUBSTITUTE:7419 SUBSTITUTE:7419 Nom Nom &msyn-extsubj-ill-nom ADD:2572:msyn-extsubj-ill-nom-pl COPY:2573:msyn-extsubj-ill-nom COPY:2621:msyn-extsubj-ill-nom
        "muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>
msyn-extsubj-ill-nom
;   "sääni" v1 N Sem/Cat Pl Acc <W:0.0> REMOVE:4913:IllNotAcc
;       "muotâ" N Sem/Substnc_Wthr Cmp/SgNom Cmp <W:0.0>

The error thus seems to be systematic, that compounds in the new format of smeared across two lines cannot be generated by our grc ruleset (?)

Trondtr commented 2 years ago

So, it seems Flammie's suggestion can be implemented in one of two ways:

  1. Give the fst we use for generation the same properties as Flammie's ../../src/generator-gramcheck-gt-norm.hfst
  2. Change from what we use to ../../src/generator-gramcheck-gt-norm.hfst

As it is the grammar checkers suffer from not being able to suggest dynamic compounds. It seems the fix is within reach, one way or another.

snomos commented 2 years ago

generator-gramcheck-gt-norm.hfst is the generator to use, cf the file name. Is that not the one being used now? It should be built and be available in the tools/grammarchecker/ dir.

Trondtr commented 2 years ago

Yes, it is the one that is used. It differs from the generator-gt- in that is does not work, hence the bug:

correct fst, wrong result:
uit-mac-443:grammarcheckers ttr000$ e muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom|hfst-lookup -q ../../src/generator-gramcheck-gt-norm.hfstol 
muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom+?   inf

wrong fst, correct result:
uit-mac-443:grammarcheckers ttr000$ e muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom|hfst-lookup -q ../../src/generator-gt-norm.hfstol 
muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom muotâsäänih 0,000000
Trondtr commented 2 years ago

The same result (= no result) I get when I use the fst in the grammarchecker folder:

uit-mac-443:grammarcheckers ttr000$ e muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom|hfst-lookup -q generator-gramcheck-gt-norm.hfstol 
muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom muotâ+N+Cmp/SgNom+Cmp#sääni+v1+N+Pl+Nom+?   inf
flammie commented 2 years ago

afaics everything is working as designed, either the cg rules need to be changed to get rid of +Cmp/SgNom (possibly others) that is not in gramchk generator or the filters of gramchecker generator need to be made more lenient.

snomos commented 2 years ago

The way I usually debug missing generator issues is to use the fst upside down, with the utility hfst-flookup. You then input the word form you expect to be generated, and in return you get the analysis that will generate it. Then compare lemmas and tags, and see if there are differences. Usually there are.

NB! hfst-flookup does not handle the optimised lookup format, so if needed reconvert to the standard Hfst format.

snomos commented 2 years ago

This is not a general issue with the grammar checker, it is specific to https://github.com/giellalt/lang-smn. Please continue the discussion in a new issue report over there.