\ symbol either misses dependency node or get ghost analysis

Trondtr commented 3 years ago

topic: Bug in analysis of our corpustexts for forthcoming SIKOR update.

problem: \ symbol either misses dependency node or get ghost analysis, where the dependency node is the final #n->m tag of each regel, and the ghost analysis is the analysis of the non-existing (=empty) character "<>".

to repeat: Run the following pipeline for one of two fst's, where the analysis pipeline in both cases is the same (standing in lang-xxx):

echo "ja \ ja" | hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3 |vislcg3 -g src/cg3/functions.cg3 | vislcg3 -g src/cg3/dependency.cg3

Option (a): run the command above with \ as lemma, where the file src/fst/generated-files/symbols.lexc has an the following entry for backslash:

\   Noun_symbols_never_inflected       ;

(this is the case today):

The analysis is:

"<ja>"
    "ja" CC <W:0.0> @X #1->0
: 
"<\>"
    "\" N Symbol <W:0.0>
: 
"<ja>"
    "ja" CC <W:0.0> @X #3->0
:\n

What is missing is the dependency node (see the rightmost node #1 and #3 (sic) on the two other words).

Option (b): In the symbols file, set backslash (or whatever) as lemma, \ as stem, the entry is then:

backslash:\  Noun_symbols_never_inflected       ;

Now recompile, run the same command, and the analysis is:

"<ja>"
    "ja" CC <W:0.0> @CVP #1->0
: 
"<\>"
    "backslash" N Symbol <W:0.0> @X #2->0
"<>"
    "'" PUNCT <W:0.0> #3->4
: 
"<ja>"
    "ja" CC <W:0.0> @CVP #4->0
:\n
"<>"
    "'" PUNCT <W:0.0> #5->5

Note that the dipendency node now is in place. Good. But the downside is the ghost analysis of "<>", that we do not want.

Neither option is optimal, but the former one is worst: Here, the dep node #3->0 is missing, and the analysis for our corpus stops. For the latter version, we get the dep node, as can be seen, but here it gets an empty reading of "<>" (with a depnode) as an unwanted passenger.

Now, (b) does not give exactly what we want, but (a) gives us no analysis at all. The Best Solution would be to have dep analysis and no ghost analysis.

Trondtr commented 3 years ago

An additional point: When the command is run in the first setting (the one in git), the full analysis is:

echo "ja \ ja" | hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3 |vislcg3 -g src/cg3/functions.cg3 | vislcg3 -g src/cg3/dependency.cg3
Warning:  "\" N Symbol <W:0.0> on line 4 looked like a reading but wasn't - treated as text.
Warning:  "\" N Symbol <W:0.0> on line 4 looked like a reading but wasn't - treated as text.
Warning:  "\" N Symbol <W:0.0> on line 4 looked like a reading but wasn't - treated as text.
"<ja>"
    "ja" CC <W:0.0> @CVP #1->0
: 
"<\>"
    "\" N Symbol <W:0.0>
: 
"<ja>"
    "ja" CC <W:0.0> @CNP #3->1

Note the three warnings. They are gone in the second analysis.

snomos commented 3 years ago

echo "ja \ ja" | hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst  
"<ja>"
    "ja" CC <W:0.0>
: 
"<\>"
    "\" N Symbol <W:0.0>
: 
"<ja>"
    "ja" CC <W:0.0>
:\n

That is, there is nothing wrong with the morphological analysis output. Thus, it looks like there can be a bug in vislcg3. Perhaps @TinoDidriksen has a comment?

snomos commented 3 years ago

Tested with lang-sme:

echo "ja \ ja" | hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \
| vislcg3 -g src/cg3/disambiguator.cg3 
Warning:  "\" N Symbol <W:0.0> on line 4 looked like a reading but wasn't - treated as text.
"<ja>"
    "ja" CC <W:0.0> <sme> @CVP
: 
"<\>"
    "\" N Symbol <W:0.0>
: 
"<ja>"
    "ja" CC <W:0.0> <sme> @CNP
:\n

It looks like "\" is treated like an escaped " inside a string, which causes the rest of the line to be misinterpreted. Whether this is a bug in vislcg3 or in the hfst-tokenise -cg output is up for debate. In addition to @TinoDidriksen also @unhammer might have viewpoints on this.

TinoDidriksen commented 3 years ago

\ is an escape character, also in the stream. "\" is not a valid baseform, but "\\" would be. The actual CG-3 bug here is that "<\>" is parsed as a valid wordform, when it shouldn't be.

snomos commented 3 years ago

So in addition to fixing https://github.com/TinoDidriksen/cg3/issues/69, hfst-tokenise -cg and variants needs to escape \ properly for CG processing. @unhammer could you have a look at that?

Trondtr commented 3 years ago

This brings us one step forward: What Sjur tested was the "option a" above: With "\" as both lemma and stem in lexc. This gives us the nonintended behaviour that Tino explains. He suggests "\" as a possible baseform. I may (and will) test that, stay tuned, (lexc entry then probably \:\ contlex ; but "option b" above already did something similar, with backshlash:\ contlex ; The result was that the dep tag worked, but we got the ghost "<>" reading. So my question is thus how to solve the problem (the missing deptag) without creating a new one (the ghost reading)

Trondtr commented 3 years ago

Problem a (the missing deptag) stops analysis, and is thus the main problem. Problem b "only" introduces an extra ghost word.

snomos commented 3 years ago

This should be fixed now, and will be available from Tino's nightly tomorrow. See work by @flammie.

I'll keep this open until the fix has been confirmed.

Trondtr commented 3 years ago

To Tino's comment:

\ is an escape character, also in the stream. "\" is not a valid baseform, but "\\" would be. The actual CG-3 bug here is that "<\>" is parsed as a valid wordform, when it shouldn't be.

Yes, "\" is indeed valid (and allows deptag), just as the lemma backshlash did. Unfortunately it also gives the ghost tag (here with \ as lemma):

"<ja>"
    "ja" CC <W:0.0> @CVP #1->0
: 
"<\>"
    "\\" N Symbol <W:0.0> @X #2->0
"<>"
    "'" PUNCT <W:0.0> #3->4
: 
"<ja>"
    "ja" CC <W:0.0> @CVP #4->0

Note that the problem arises before sending the reading to vislcg3, cf. the blank line after "\" (again with a string different from "\" as lemma for "\". The blank line is what gives rise to the ghost reading "<>":

echo "ja \ ja"|hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst 
ja
\

ja

flammie commented 3 years ago

I get the following with current versions in git:

echo "ja \ ja" | ~/github/hfst/hfst/tools/src/hfst-tokenize -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3 |vislcg3 -g src/cg3/functions.cg3 | vislcg3 -g src/cg3/dependency.cg3
"<ja>"
    "ja" CC <W:0.0> @CVP #1->0
: 
"<\\>"
    "\\" N Symbol <W:0.0> @X #2->0
: 
"<ja>"
    "ja" CC <W:0.0> @CNP #3->2
:\n

snomos commented 3 years ago

Looks good. The only caveat is that we need to de-escape on output in cases where that matters.

Trondtr commented 3 years ago

Now we are talking :-) I tried to compile hfst from git, but got the same version number as I already had, hfst-tokenize 0.1 (hfst 3.15.2), (and the same result), so it seems I will have to wait for nightly build to confirm and close. But the result Tommi shows is what we want.

snomos commented 3 years ago

Confirmed with nightly:

First case - non-CG-stream variant, no escape:

echo " jo \ ja "|hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
jo
\
ja

Second case, CG-stream variant, with escape:

echo " jo \ ja "|hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
: 
"<jo>"
    "jo" Adv <W:0.0>
    "jo" Interj Err/Lex <W:0.0>
: 
"<\\>"
    "\\" N Symbol <W:0.0>
: 
"<ja>"
    "ja" CC <W:0.0>
: \n

Fixed and closed.

Trondtr commented 3 years ago

Closing was a bit too fast. I have:

updated hfst-tokenize from nightly

uit-mac-443:lang-sje ttr000$ which hfst-tokenize
/usr/local/bin/hfst-tokenize

ls -l /usr/local/bin/hfst-tokenize
-rwxr-xr-x  1 jamfmanage  wheel  100820  1 des 18:56 /usr/local/bin/hfst-tokenize

updated the whole giellalt catalogue set done make clean && make in lang-sje (configuration is ./configure --with-hfst --enable-tokenisers --enable-reversed-intersect) Result is that I am not able to reproduce Tommi and Sjur:

uit-mac-443:lang-sje ttr000$ echo " ja \ ja "|hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
"<>"
    "'" PUNCT <W:0.0>
: 
"<ja>"
    "ja" CC <W:0.0>
: 
"<\\>"
    "\\" N Symbol <W:0.0>
"<>"
    "'" PUNCT <W:0.0>
: 
"<ja>"
    "ja" CC <W:0.0>
: 
"<>"
    "'" PUNCT <W:0.0>
:\n
"<>"
    "'" PUNCT <W:0.0>
uit-mac-443:lang-sje ttr000$ echo " ja \ ja "|hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst

ja
\

ja

Chiara gets the same result as me (i.e. the same as before the update).

carges commented 3 years ago

Yes, I recompiled and tested in lang-mhr.

TinoDidriksen commented 3 years ago

The \\ part works now. But not all languages analyse \ the same.

In -kal:

$ echo " ja \ ja "|hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
:
"<ja>"
        "ja" Interj <W:0.0>
:
"<\\>"
        "\\" Symbol N <W:0.0>
:
"<ja>"
        "ja" Interj <W:0.0>
: \n

But in -sje:

$ echo " ja \ ja "|hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
"<>"
        "'" PUNCT <W:0.0>
:
"<ja>"
        "ja" CC <W:0.0>
:
"<\\>"
        "\\" N Symbol <W:0.0>
"<>"
        "'" PUNCT <W:0.0>
:
"<ja>"
        "ja" CC <W:0.0>
:
"<>"
        "'" PUNCT <W:0.0>
:\n
"<>"
        "'" PUNCT <W:0.0>

...or actually, the space is what looks to be analysed here, when it shouldn't be.

Trondtr commented 3 years ago

This is thus the mystery, I see no explanation why they should behave differently.

The thing is that both kal and sje get the \ from the same file (generated-files/symbols.lexc), and both point the \ entry to the same file and contlex (in affixes/symbols.lexc, for both languages the lexicon points to the sole entry

+N+Symbol: # ;

I also see no mentioning of \ for either language in the tools/tokenisers catalogue.

I now recompile kal to see what happens.

snomos commented 3 years ago

My guess is that the empty

"<>"
        "'" PUNCT <W:0.0>

analysis is not related to the \ symbol at all, but an artifact of bad twolc rules or alphabets or somesuch. The empty analysis appears several places, not just after the \. I still maintain that the \ bug is fixed, the other issues are other issues.

Trondtr commented 3 years ago

It seems you are right after all.

I reopened the bug since I got the same error as before for sje and Chiara got it for mhr. Whereas sje is irrelevant to Korp, mhr will provide 2/3 of all text in Korp, and we thus got the error for 2 of 2 lgs, and I reopened.

With Sjur's comment in mind, I tested all the languages that are in the pipeline for korp update. What I got surprised me:

Ghost reading for the "ja \ ja" string: sje (even after repeated testing) No ghost reading for the "ja \ ja" string: smn, mrj, sma, sme, mhr, kpv, fit, fao, vep, smj, mdf, vro, udm, myv, fkv, nob, fin

I now ignore sje, as we do not have it in korp. I do not know what caused Chiara's mhr to misbehave, and we will have to find our, but since I get all fsts to work except sje, the eventual error will be different from the one discussed here, and have to end up in a new bug report (or one for sje and one for mhr).

Chiara and I have had a look at it. mhr is still forthcoming, but she got nob and fkv to work.

We close the bug.

GiellaLT-Archive / giella-shared

\ symbol either misses dependency node or get ghost analysis #1