aymara / lima

The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
http://aymara.github.io/lima/
Other
104 stars 20 forks source link

SVM PosTagger fails on document without recovery on error #127

Closed benlabbe closed 2 years ago

benlabbe commented 2 years ago

Describe the bug The SVM PosTagger fails sometimes on several documents. The origin of the errors are not clear (either inside SVMTool or in the way we use it). This leads the processus to crash (either analyzeText or analyzeXml).

To Reproduce This issue is linked with #95 which describes one of the errors occasionally encountered.

More examples are needed . Please @benlabbe, you are summoned to upload XML sample files !

Expected behavior What ever the reason of the errors in the SVM PosTagger, the text processing should continue without side-effects for the following text segments to analyze (eg : for the following paragraphs in an Xml file).

benlabbe commented 2 years ago

Here are the first elements of my investigations.

Here is a sample XML file causing SVMTagger to crash : 02552_GS_RC_MEC_682_EN_00.xml Sample error log after my correction in SetCompilerFlags.cmake :

user:home$ analyzeXml -l eng -p TechnipTenderXML 02552_GS_RC_MEC_682_EN_00.xml
 : LP::PosTagger : 2021-12-09T15:26:39.586 ERROR 0x55e55a2fe5a0 Error in SVMTagger result line: did not get 2 elements in '  ' 
 : LP::CoreClient : 2021-12-09T15:26:39.586 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:39.587 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 6149 
 : LP::PosTagger : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 Error in SVMTagger result line: did not get 2 elements in '  ' 
 : LP::CoreClient : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 10389 
 : LP::PosTagger : 2021-12-09T15:26:41.809 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' . : SENT ' from SVMTagger and ' "\n\n." ' from graph 
 : LP::CoreClient : 2021-12-09T15:26:41.810 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:41.810 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 52927 
 : LP::PosTagger : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' .5 : NOUN ' from SVMTagger and ' "\n.5" ' from graph 
 : LP::CoreClient : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 55901 
 : LP::PosTagger : 2021-12-09T15:26:41.969 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' . : SENT ' from SVMTagger and ' "\n\n." ' from graph 
 : LP::CoreClient : 2021-12-09T15:26:41.969 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:41.970 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 56480 
Total: 5317 ms

02552_GS_RC_MEC_682_EN_00.xml.zip

benlabbe commented 2 years ago

The recovery on error is handled in Release mode thanks to the fix on WITH_DEBUG_MESSAGES in commit e8e2e1185943215bcb419e8a158d1241988276af . This allows to process large XML files where each page is a node (engText) with a minimized impact on the final result

The SVMTag crash is still not solved.

kleag commented 2 years ago

Solved in commit 876c29367f0dfcf49143ad126e77f957ee75ad0c:

gael@brezhoneg2:~/Téléchargements$ echo -e "\n.0" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0" 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0" 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 No matching category found for tagger result  ".0"   "NOUN" 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Taking any one 
# global.columns = ID   FORM    LEMMA   UPOS    XPOS    FEATS   HEAD    DEPREL  DEPS    MISC
# sent_id = 1
# text =  .0 
1       \x0a.0
.0      NUM     _       _       _       _       _       NE=I-Numex.NUMBER|Pos=1|Len=3

gael@brezhoneg2:~/Téléchargements$ echo -e "some text.\n.\n" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:06:10.981  WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n." 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n." 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 No matching category found for tagger result  ".\u200B."   "NOUN" 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Taking any one 
# global.columns = ID   FORM    LEMMA   UPOS    XPOS    FEATS   HEAD    DEPREL  DEPS    MISC
# sent_id = 1
# text = some text.
1       some    some    DET     _       _       2       det     _       Pos=1|Len=4
2       text    text    NOUN    _       NUMBER=SING     3       Dummy   _       Pos=6|Len=4|SpaceAfter=No
3       .\x0a.  .
.       SENT    _       _       0       _       _       Pos=10|Len=3

But it does not solve the underlying tokenizer error.

benlabbe commented 2 years ago

Dear @kleag ,

I got a new example that crashes the SVMPosTagger. The malicious characters are the succession of three dots : "..." . I managed to overcome the issue by replacing in the analyzed text with the unicode 2026 + two spaces : "… ".