aymara / lima

The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
http://aymara.github.io/lima/
Other
107 stars 21 forks source link

Empty lemma in bowToken #107

Closed benlabbe closed 4 years ago

benlabbe commented 4 years ago

Dear fellow developers,

I am facing another issue with the latest rule-based versions of Lima including 3332df310a (Oct 8 2020) regarding the 00734_sample.xml text in the attached archive.

<?xml version='1.0' encoding='utf-8'?>
<DOCSET>
 <DOC id="00734_sample">
  <DOCID>00734_sample</DOCID>
  <freTITLE>Circulaire ndeg 130 du 01-3-2018</freTITLE>
  <TEXT>
Jusqu'à 10 k€ Adjoints aux directeurs (2) 
  </TEXT>
 </DOC>
</DOCSET>

EmptyLemmaSample.zip

Undesired behaviour : empty lemma in a bowToken

I am using a dedicated pipeline named BriceaAmoseAnalysis in analyzeXml to prepare multimedia documents for indexation in the AMOSE search engine.

analyzeXml -l fre -p BriceaAmoseAnalysis 00734_sample.xml

You can use readMultFile on the 00734_sample.xml.mult binary output to see that
the bowTerm id=33 contains a bowNamedEntity id=34 of type Numex.UNIT with an empty lemma

readMultFile 00734_sample.xml.mult
...
            <bowTerm id="33" lemma="_directeur" category="NC" position="718" length="25">
              <parts head="0">
                <bowNamedEntity id="34" lemma="" category="NC" position="718" length="1" type="Numex.UNIT">
                  <parts head="0">
                    <bowToken id="35" lemma="" category="NC" position="718" length="1"/>
                  </parts>
                  <feature name="value" value=""/>
                </bowNamedEntity>
                <bowRelation realization="au+les" type="3"/>
                <bowToken id="36" lemma="directeur" category="NC" position="733" length="10"/>
              </parts>
            </bowTerm>
...

Expected behaviour

@kleag , @victorbocharov , Do you have any hints to solve this problem in the proper way ?

victorbocharov commented 4 years ago

I confirm this with pipeline "main" and ConllDumper. morphoData has one element with an empty lemma. The same happens without SpecificEntities.

@kleag is this the intended behavior for OOV?

kleag commented 4 years ago

@kleag is this the intended behavior for OOV? No, if no lemma is found, the raw text should be used as the lemma. I'll have a look to that.

kleag commented 4 years ago

I was not able to find where the lemma is lost. I tried the change bellow (You can ignore the new debug messages) but even if the RecognizerMatch:getNormalizedString result: is now not empty, the resulting lemma is still empty in the final BoWToken. I give the hand to @romaricb to go deeper in the problem.

diff --git lima_linguisticprocessing/src/linguisticProcessing/core/Automaton/recognizerMatch.cpp lima_linguisticprocessing/src/linguisticProcessing/core/Automaton/recognizerMatch.cpp
index 11a957bd..556b06d3 100644
--- lima_linguisticprocessing/src/linguisticProcessing/core/Automaton/recognizerMatch.cpp
+++ lima_linguisticprocessing/src/linguisticProcessing/core/Automaton/recognizerMatch.cpp
@@ -222,6 +222,10 @@ LimaString RecognizerMatch::getString() const {
 }

 LimaString RecognizerMatch::getNormalizedString(const FsaStringsPool& sp) const {
+#ifdef DEBUG_LP
+  AULOGINIT;
+  LDEBUG << "RecognizerMatch:getNormalizedString" << empty();
+#endif
   LimaString str;
   uint64_t currentPosition(0);
   if (empty()) {
@@ -239,11 +243,21 @@ LimaString RecognizerMatch::getNormalizedString(const FsaStringsPool& sp) const
         firstHyphenPassed = true;
       }
       MorphoSyntacticData* data = get(vertex_data,*(m_graph->getGraph()),v);
+#ifdef DEBUG_LP
+      LDEBUG << "RecognizerMatch:getNormalizedString data" << data;
+#endif

-      if (data==0 || data->empty()) {
+      if (data==0 || data->empty()
+        || sp[data->front().normalizedForm].isEmpty()) {
+#ifdef DEBUG_LP
+        LDEBUG << "RecognizerMatch:getNormalizedString stringForm" << t->stringForm();
+#endif
         str += t->stringForm();
       }
       else {
+#ifdef DEBUG_LP
+        LDEBUG << "RecognizerMatch:getNormalizedString first norm" << sp[data->front().normalizedForm];
+#endif
         // take first norm
         str += sp[data->front().normalizedForm];
       }
@@ -279,7 +293,8 @@ LimaString RecognizerMatch::getNormalizedString(const FsaStringsPool& sp) const
           }
         }

-        if (data == 0 || data->empty()) {
+        if (data == 0 || data->empty()
+          || sp[data->front().normalizedForm].isEmpty()) {
           str += t->stringForm();
         }
         else {
@@ -292,6 +307,9 @@ LimaString RecognizerMatch::getNormalizedString(const FsaStringsPool& sp) const
     }
     i++;
   }
+#ifdef DEBUG_LP
+  LDEBUG << "RecognizerMatch:getNormalizedString result:" << str;
+#endif
   return str;
 }
romaricb commented 4 years ago

I tracked the problem in the defaultProperties unit: the unknown words are associated with a lemma which is an unmarked form of themselves and the CharChart unmark() function for the euro symbol returns "". Some tokens are not unmarked (e.g. numbers), depending on their tokenizer status (listed in the conf file, parameter skipUnmarkStatus of defaultProperties). The tokenizer status of € is t_small: adding the t_small status in the skipUnmarkStatus leads to the expected behavior.

Form there, what is the correct solution ?

I suggest to at least implement the 3rd one as a safety.

@kleag ?

kleag commented 4 years ago

OK with the 3rd proposition @romaricb

romaricb commented 4 years ago

Done in commit c21269f2bbfd56efcd6cb1ef2bc15421192d61d4.