Closed Trondtr closed 3 years ago
An additional point: When the command is run in the first setting (the one in git), the full analysis is:
echo "ja \ ja" | hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3 |vislcg3 -g src/cg3/functions.cg3 | vislcg3 -g src/cg3/dependency.cg3
Warning: "\" N Symbol <W:0.0> on line 4 looked like a reading but wasn't - treated as text.
Warning: "\" N Symbol <W:0.0> on line 4 looked like a reading but wasn't - treated as text.
Warning: "\" N Symbol <W:0.0> on line 4 looked like a reading but wasn't - treated as text.
"<ja>"
"ja" CC <W:0.0> @CVP #1->0
:
"<\>"
"\" N Symbol <W:0.0>
:
"<ja>"
"ja" CC <W:0.0> @CNP #3->1
Note the three warnings. They are gone in the second analysis.
echo "ja \ ja" | hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
"<ja>"
"ja" CC <W:0.0>
:
"<\>"
"\" N Symbol <W:0.0>
:
"<ja>"
"ja" CC <W:0.0>
:\n
That is, there is nothing wrong with the morphological analysis output. Thus, it looks like there can be a bug in vislcg3. Perhaps @TinoDidriksen has a comment?
Tested with lang-sme
:
echo "ja \ ja" | hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst \
| vislcg3 -g src/cg3/disambiguator.cg3
Warning: "\" N Symbol <W:0.0> on line 4 looked like a reading but wasn't - treated as text.
"<ja>"
"ja" CC <W:0.0> <sme> @CVP
:
"<\>"
"\" N Symbol <W:0.0>
:
"<ja>"
"ja" CC <W:0.0> <sme> @CNP
:\n
It looks like "\"
is treated like an escaped "
inside a string, which causes the rest of the line to be misinterpreted. Whether this is a bug in vislcg3
or in the hfst-tokenise -cg
output is up for debate. In addition to @TinoDidriksen also @unhammer might have viewpoints on this.
\
is an escape character, also in the stream. "\"
is not a valid baseform, but "\\"
would be. The actual CG-3 bug here is that "<\>"
is parsed as a valid wordform, when it shouldn't be.
So in addition to fixing https://github.com/TinoDidriksen/cg3/issues/69, hfst-tokenise -cg
and variants needs to escape \
properly for CG processing. @unhammer could you have a look at that?
This brings us one step forward: What Sjur tested was the "option a" above: With "\" as both lemma and stem in lexc. This gives us the nonintended behaviour that Tino explains. He suggests "\" as a possible baseform. I may (and will) test that, stay tuned, (lexc entry then probably \:\ contlex ; but "option b" above already did something similar, with backshlash:\ contlex ; The result was that the dep tag worked, but we got the ghost "<>" reading. So my question is thus how to solve the problem (the missing deptag) without creating a new one (the ghost reading)
Problem a (the missing deptag) stops analysis, and is thus the main problem. Problem b "only" introduces an extra ghost word.
This should be fixed now, and will be available from Tino's nightly tomorrow. See work by @flammie.
I'll keep this open until the fix has been confirmed.
To Tino's comment:
\
is an escape character, also in the stream."\"
is not a valid baseform, but"\\"
would be. The actual CG-3 bug here is that"<\>"
is parsed as a valid wordform, when it shouldn't be.
Yes, "\" is indeed valid (and allows deptag), just as the lemma backshlash did. Unfortunately it also gives the ghost tag (here with \ as lemma):
"<ja>"
"ja" CC <W:0.0> @CVP #1->0
:
"<\>"
"\\" N Symbol <W:0.0> @X #2->0
"<>"
"'" PUNCT <W:0.0> #3->4
:
"<ja>"
"ja" CC <W:0.0> @CVP #4->0
Note that the problem arises before sending the reading to vislcg3, cf. the blank line after "\" (again with a string different from "\" as lemma for "\". The blank line is what gives rise to the ghost reading "<>":
echo "ja \ ja"|hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
ja
\
ja
I get the following with current versions in git:
echo "ja \ ja" | ~/github/hfst/hfst/tools/src/hfst-tokenize -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3 |vislcg3 -g src/cg3/functions.cg3 | vislcg3 -g src/cg3/dependency.cg3
"<ja>"
"ja" CC <W:0.0> @CVP #1->0
:
"<\\>"
"\\" N Symbol <W:0.0> @X #2->0
:
"<ja>"
"ja" CC <W:0.0> @CNP #3->2
:\n
Looks good. The only caveat is that we need to de-escape on output in cases where that matters.
Now we are talking :-) I tried to compile hfst from git, but got the same version number as I already had, hfst-tokenize 0.1 (hfst 3.15.2), (and the same result), so it seems I will have to wait for nightly build to confirm and close. But the result Tommi shows is what we want.
Confirmed with nightly:
First case - non-CG-stream variant, no escape:
echo " jo \ ja "|hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
jo
\
ja
Second case, CG-stream variant, with escape:
echo " jo \ ja "|hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
:
"<jo>"
"jo" Adv <W:0.0>
"jo" Interj Err/Lex <W:0.0>
:
"<\\>"
"\\" N Symbol <W:0.0>
:
"<ja>"
"ja" CC <W:0.0>
: \n
Fixed and closed.
Closing was a bit too fast. I have:
updated hfst-tokenize from nightly
uit-mac-443:lang-sje ttr000$ which hfst-tokenize
/usr/local/bin/hfst-tokenize
ls -l /usr/local/bin/hfst-tokenize
-rwxr-xr-x 1 jamfmanage wheel 100820 1 des 18:56 /usr/local/bin/hfst-tokenize
updated the whole giellalt catalogue set done make clean && make in lang-sje (configuration is ./configure --with-hfst --enable-tokenisers --enable-reversed-intersect) Result is that I am not able to reproduce Tommi and Sjur:
uit-mac-443:lang-sje ttr000$ echo " ja \ ja "|hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
"<>"
"'" PUNCT <W:0.0>
:
"<ja>"
"ja" CC <W:0.0>
:
"<\\>"
"\\" N Symbol <W:0.0>
"<>"
"'" PUNCT <W:0.0>
:
"<ja>"
"ja" CC <W:0.0>
:
"<>"
"'" PUNCT <W:0.0>
:\n
"<>"
"'" PUNCT <W:0.0>
uit-mac-443:lang-sje ttr000$ echo " ja \ ja "|hfst-tokenise tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
ja
\
ja
Chiara gets the same result as me (i.e. the same as before the update).
Yes, I recompiled and tested in lang-mhr.
The \\
part works now. But not all languages analyse \
the same.
In -kal:
$ echo " ja \ ja "|hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
:
"<ja>"
"ja" Interj <W:0.0>
:
"<\\>"
"\\" Symbol N <W:0.0>
:
"<ja>"
"ja" Interj <W:0.0>
: \n
But in -sje:
$ echo " ja \ ja "|hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
"<>"
"'" PUNCT <W:0.0>
:
"<ja>"
"ja" CC <W:0.0>
:
"<\\>"
"\\" N Symbol <W:0.0>
"<>"
"'" PUNCT <W:0.0>
:
"<ja>"
"ja" CC <W:0.0>
:
"<>"
"'" PUNCT <W:0.0>
:\n
"<>"
"'" PUNCT <W:0.0>
...or actually, the space is what looks to be analysed here, when it shouldn't be.
This is thus the mystery, I see no explanation why they should behave differently.
The thing is that both kal and sje get the \ from the same file (generated-files/symbols.lexc), and both point the \ entry to the same file and contlex (in affixes/symbols.lexc, for both languages the lexicon points to the sole entry
+N+Symbol: # ;
I also see no mentioning of \ for either language in the tools/tokenisers catalogue.
I now recompile kal to see what happens.
My guess is that the empty
"<>"
"'" PUNCT <W:0.0>
analysis is not related to the \
symbol at all, but an artifact of bad twolc rules or alphabets or somesuch. The empty analysis appears several places, not just after the \
. I still maintain that the \
bug is fixed, the other issues are other issues.
It seems you are right after all.
I reopened the bug since I got the same error as before for sje and Chiara got it for mhr. Whereas sje is irrelevant to Korp, mhr will provide 2/3 of all text in Korp, and we thus got the error for 2 of 2 lgs, and I reopened.
With Sjur's comment in mind, I tested all the languages that are in the pipeline for korp update. What I got surprised me:
Ghost reading for the "ja \ ja" string: sje (even after repeated testing) No ghost reading for the "ja \ ja" string: smn, mrj, sma, sme, mhr, kpv, fit, fao, vep, smj, mdf, vro, udm, myv, fkv, nob, fin
I now ignore sje, as we do not have it in korp. I do not know what caused Chiara's mhr to misbehave, and we will have to find our, but since I get all fsts to work except sje, the eventual error will be different from the one discussed here, and have to end up in a new bug report (or one for sje and one for mhr).
Chiara and I have had a look at it. mhr is still forthcoming, but she got nob and fkv to work.
We close the bug.
topic: Bug in analysis of our corpustexts for forthcoming SIKOR update.
problem: \ symbol either misses dependency node or get ghost analysis, where the dependency node is the final #n->m tag of each regel, and the ghost analysis is the analysis of the non-existing (=empty) character "<>".
to repeat: Run the following pipeline for one of two fst's, where the analysis pipeline in both cases is the same (standing in lang-xxx):
echo "ja \ ja" | hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |vislcg3 -g src/cg3/disambiguator.cg3 |vislcg3 -g src/cg3/functions.cg3 | vislcg3 -g src/cg3/dependency.cg3
Option (a): run the command above with
\
as lemma, where the file src/fst/generated-files/symbols.lexc has an the following entry for backslash:(this is the case today):
The analysis is:
What is missing is the dependency node (see the rightmost node
#1
and#3
(sic) on the two other words).Option (b): In the symbols file, set
backslash
(or whatever) as lemma,\
as stem, the entry is then:Now recompile, run the same command, and the analysis is:
Note that the dipendency node now is in place. Good. But the downside is the ghost analysis of "<>", that we do not want.
Neither option is optimal, but the former one is worst: Here, the dep node #3->0 is missing, and the analysis for our corpus stops. For the latter version, we get the dep node, as can be seen, but here it gets an empty reading of "<>" (with a depnode) as an unwanted passenger.
Now, (b) does not give exactly what we want, but (a) gives us no analysis at all. The Best Solution would be to have dep analysis and no ghost analysis.