Closed benlabbe closed 4 years ago
I confirm this with pipeline "main" and ConllDumper. morphoData has one element with an empty lemma. The same happens without SpecificEntities.
@kleag is this the intended behavior for OOV?
@kleag is this the intended behavior for OOV? No, if no lemma is found, the raw text should be used as the lemma. I'll have a look to that.
I was not able to find where the lemma is lost. I tried the change bellow (You can ignore the new debug messages) but even if the RecognizerMatch:getNormalizedString result: is now not empty, the resulting lemma is still empty in the final BoWToken. I give the hand to @romaricb to go deeper in the problem.
diff --git lima_linguisticprocessing/src/linguisticProcessing/core/Automaton/recognizerMatch.cpp lima_linguisticprocessing/src/linguisticProcessing/core/Automaton/recognizerMatch.cpp
index 11a957bd..556b06d3 100644
--- lima_linguisticprocessing/src/linguisticProcessing/core/Automaton/recognizerMatch.cpp
+++ lima_linguisticprocessing/src/linguisticProcessing/core/Automaton/recognizerMatch.cpp
@@ -222,6 +222,10 @@ LimaString RecognizerMatch::getString() const {
}
LimaString RecognizerMatch::getNormalizedString(const FsaStringsPool& sp) const {
+#ifdef DEBUG_LP
+ AULOGINIT;
+ LDEBUG << "RecognizerMatch:getNormalizedString" << empty();
+#endif
LimaString str;
uint64_t currentPosition(0);
if (empty()) {
@@ -239,11 +243,21 @@ LimaString RecognizerMatch::getNormalizedString(const FsaStringsPool& sp) const
firstHyphenPassed = true;
}
MorphoSyntacticData* data = get(vertex_data,*(m_graph->getGraph()),v);
+#ifdef DEBUG_LP
+ LDEBUG << "RecognizerMatch:getNormalizedString data" << data;
+#endif
- if (data==0 || data->empty()) {
+ if (data==0 || data->empty()
+ || sp[data->front().normalizedForm].isEmpty()) {
+#ifdef DEBUG_LP
+ LDEBUG << "RecognizerMatch:getNormalizedString stringForm" << t->stringForm();
+#endif
str += t->stringForm();
}
else {
+#ifdef DEBUG_LP
+ LDEBUG << "RecognizerMatch:getNormalizedString first norm" << sp[data->front().normalizedForm];
+#endif
// take first norm
str += sp[data->front().normalizedForm];
}
@@ -279,7 +293,8 @@ LimaString RecognizerMatch::getNormalizedString(const FsaStringsPool& sp) const
}
}
- if (data == 0 || data->empty()) {
+ if (data == 0 || data->empty()
+ || sp[data->front().normalizedForm].isEmpty()) {
str += t->stringForm();
}
else {
@@ -292,6 +307,9 @@ LimaString RecognizerMatch::getNormalizedString(const FsaStringsPool& sp) const
}
i++;
}
+#ifdef DEBUG_LP
+ LDEBUG << "RecognizerMatch:getNormalizedString result:" << str;
+#endif
return str;
}
I tracked the problem in the defaultProperties unit: the unknown words are associated with a lemma which is an unmarked form of themselves and the CharChart unmark() function for the euro symbol returns "". Some tokens are not unmarked (e.g. numbers), depending on their tokenizer status (listed in the conf file, parameter skipUnmarkStatus of defaultProperties). The tokenizer status of € is t_small: adding the t_small status in the skipUnmarkStatus leads to the expected behavior.
Form there, what is the correct solution ?
I suggest to at least implement the 3rd one as a safety.
@kleag ?
OK with the 3rd proposition @romaricb
Done in commit c21269f2bbfd56efcd6cb1ef2bc15421192d61d4.
Dear fellow developers,
I am facing another issue with the latest rule-based versions of Lima including 3332df310a (Oct 8 2020) regarding the
00734_sample.xml
text in the attached archive.EmptyLemmaSample.zip
Undesired behaviour : empty lemma in a bowToken
I am using a dedicated pipeline named BriceaAmoseAnalysis in
analyzeXml
to prepare multimedia documents for indexation in the AMOSE search engine.You can use readMultFile on the
00734_sample.xml.mult
binary output to see thatthe bowTerm id=33 contains a bowNamedEntity id=34 of type Numex.UNIT with an empty lemma
€
(and others like $ and £) has no dedicated entry in the French lefff dictionary of lima_linguisticdata/analysisDictionary/ .Expected behaviour
lima_linguisticdata/scratch/LinguisticProcessings/fre/tokenizerAutomaton-fre.chars.tok
and read that the 20AC UTF8 char is correctly defined as the € euro symbol@kleag , @victorbocharov , Do you have any hints to solve this problem in the proper way ?