Closed romaricb closed 2 years ago
Same kind of error with another text: Error in SVMTagger result line: did not get 2 elements in ' ' with echo -e "some text.\n.\n" > test.txt; analyzeText --language=eng test.txt
Maybe a problem with the tokenizer ?
Solved in commit 876c29367f0dfcf49143ad126e77f957ee75ad0c:
gael@brezhoneg2:~/Téléchargements$ echo -e "\n.0" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0"
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked.
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0"
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked.
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 No matching category found for tagger result ".0" "NOUN"
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Taking any one
# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
# sent_id = 1
# text = .0
1 \x0a.0
.0 NUM _ _ _ _ _ NE=I-Numex.NUMBER|Pos=1|Len=3
gael@brezhoneg2:~/Téléchargements$ echo -e "some text.\n.\n" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:06:10.981 WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n."
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked.
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n."
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked.
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 No matching category found for tagger result ".\u200B." "NOUN"
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 Taking any one
# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
# sent_id = 1
# text = some text.
1 some some DET _ _ 2 det _ Pos=1|Len=4
2 text text NOUN _ NUMBER=SING 3 Dummy _ Pos=6|Len=4|SpaceAfter=No
3 .\x0a. .
. SENT _ _ 0 _ _ Pos=10|Len=3
But it does not solve the underlying tokenizer error.
With a text containing "\n.0", the SVM pos-tagger produces the following error: Error in SVMTagger result alignement with analysis graph: got ' .0 ' from SVMTagger and ' "\n.0" ' from graph
To reproduce: echo -e "\n.0" > test.txt; analyzeText --language=eng test.txt
Note: this error does not appear with --language=fre