aymara / lima

The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
http://aymara.github.io/lima/
Other
105 stars 21 forks source link

Error on SVM PosTagger when a new line starts with ".0" #95

Closed romaricb closed 2 years ago

romaricb commented 4 years ago

With a text containing "\n.0", the SVM pos-tagger produces the following error: Error in SVMTagger result alignement with analysis graph: got ' .0 ' from SVMTagger and ' "\n.0" ' from graph

To reproduce: echo -e "\n.0" > test.txt; analyzeText --language=eng test.txt

Note: this error does not appear with --language=fre

romaricb commented 4 years ago

Same kind of error with another text: Error in SVMTagger result line: did not get 2 elements in ' ' with echo -e "some text.\n.\n" > test.txt; analyzeText --language=eng test.txt

Maybe a problem with the tokenizer ?

kleag commented 2 years ago

Solved in commit 876c29367f0dfcf49143ad126e77f957ee75ad0c:

gael@brezhoneg2:~/Téléchargements$ echo -e "\n.0" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0" 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0" 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 No matching category found for tagger result  ".0"   "NOUN" 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Taking any one 
# global.columns = ID   FORM    LEMMA   UPOS    XPOS    FEATS   HEAD    DEPREL  DEPS    MISC
# sent_id = 1
# text =  .0 
1       \x0a.0
.0      NUM     _       _       _       _       _       NE=I-Numex.NUMBER|Pos=1|Len=3

gael@brezhoneg2:~/Téléchargements$ echo -e "some text.\n.\n" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:06:10.981  WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n." 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n." 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 No matching category found for tagger result  ".\u200B."   "NOUN" 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Taking any one 
# global.columns = ID   FORM    LEMMA   UPOS    XPOS    FEATS   HEAD    DEPREL  DEPS    MISC
# sent_id = 1
# text = some text.
1       some    some    DET     _       _       2       det     _       Pos=1|Len=4
2       text    text    NOUN    _       NUMBER=SING     3       Dummy   _       Pos=6|Len=4|SpaceAfter=No
3       .\x0a.  .
.       SENT    _       _       0       _       _       Pos=10|Len=3

But it does not solve the underlying tokenizer error.