Closed benlabbe closed 2 years ago
Here are the first elements of my investigations.
WITH_DEBUG_MESSAGES
acts not as expected.
SetCompilerFlags.cmake
which defines WITH_DEBUG_MESSAGES
as a cmake option..mult
output file : no content, but some properties are reported by readMultFile for these nodes..mult
file..mult
fileHere is a sample XML file causing SVMTagger to crash : 02552_GS_RC_MEC_682_EN_00.xml
Sample error log after my correction in SetCompilerFlags.cmake
:
user:home$ analyzeXml -l eng -p TechnipTenderXML 02552_GS_RC_MEC_682_EN_00.xml
: LP::PosTagger : 2021-12-09T15:26:39.586 ERROR 0x55e55a2fe5a0 Error in SVMTagger result line: did not get 2 elements in ' '
: LP::CoreClient : 2021-12-09T15:26:39.586 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit"
: XML::DocumentsReader : 2021-12-09T15:26:39.587 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 6149
: LP::PosTagger : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 Error in SVMTagger result line: did not get 2 elements in ' '
: LP::CoreClient : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit"
: XML::DocumentsReader : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 10389
: LP::PosTagger : 2021-12-09T15:26:41.809 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' . : SENT ' from SVMTagger and ' "\n\n." ' from graph
: LP::CoreClient : 2021-12-09T15:26:41.810 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit"
: XML::DocumentsReader : 2021-12-09T15:26:41.810 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 52927
: LP::PosTagger : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' .5 : NOUN ' from SVMTagger and ' "\n.5" ' from graph
: LP::CoreClient : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit"
: XML::DocumentsReader : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 55901
: LP::PosTagger : 2021-12-09T15:26:41.969 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' . : SENT ' from SVMTagger and ' "\n\n." ' from graph
: LP::CoreClient : 2021-12-09T15:26:41.969 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit"
: XML::DocumentsReader : 2021-12-09T15:26:41.970 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 56480
Total: 5317 ms
The recovery on error is handled in Release mode thanks to the fix on WITH_DEBUG_MESSAGES in commit e8e2e1185943215bcb419e8a158d1241988276af . This allows to process large XML files where each page is a node (engText) with a minimized impact on the final result
The SVMTag crash is still not solved.
Solved in commit 876c29367f0dfcf49143ad126e77f957ee75ad0c:
gael@brezhoneg2:~/Téléchargements$ echo -e "\n.0" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0"
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked.
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0"
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked.
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 No matching category found for tagger result ".0" "NOUN"
: LP::PosTagger : 2021-12-16T00:05:51.228 WARN 0x56012aba6760 Taking any one
# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
# sent_id = 1
# text = .0
1 \x0a.0
.0 NUM _ _ _ _ _ NE=I-Numex.NUMBER|Pos=1|Len=3
gael@brezhoneg2:~/Téléchargements$ echo -e "some text.\n.\n" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:06:10.981 WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n."
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked.
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n."
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked.
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 No matching category found for tagger result ".\u200B." "NOUN"
: LP::PosTagger : 2021-12-16T00:06:10.982 WARN 0x5614edead760 Taking any one
# global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
# sent_id = 1
# text = some text.
1 some some DET _ _ 2 det _ Pos=1|Len=4
2 text text NOUN _ NUMBER=SING 3 Dummy _ Pos=6|Len=4|SpaceAfter=No
3 .\x0a. .
. SENT _ _ 0 _ _ Pos=10|Len=3
But it does not solve the underlying tokenizer error.
Dear @kleag ,
I got a new example that crashes the SVMPosTagger. The malicious characters are the succession of three dots : "..." . I managed to overcome the issue by replacing in the analyzed text with the unicode 2026 + two spaces : "… ".
Describe the bug The SVM PosTagger fails sometimes on several documents. The origin of the errors are not clear (either inside SVMTool or in the way we use it). This leads the processus to crash (either analyzeText or analyzeXml).
To Reproduce This issue is linked with #95 which describes one of the errors occasionally encountered.
More examples are needed . Please @benlabbe, you are summoned to upload XML sample files !
Expected behavior What ever the reason of the errors in the SVM PosTagger, the text processing should continue without side-effects for the following text segments to analyze (eg : for the following paragraphs in an Xml file).