Empty lemma in bowToken

benlabbe commented 4 years ago

Dear fellow developers,

I am facing another issue with the latest rule-based versions of Lima including 3332df310a (Oct 8 2020) regarding the 00734_sample.xml text in the attached archive.

<?xml version='1.0' encoding='utf-8'?>
<DOCSET>
 <DOC id="00734_sample">
  <DOCID>00734_sample</DOCID>
  <freTITLE>Circulaire ndeg 130 du 01-3-2018</freTITLE>
  <TEXT>
Jusqu'à 10 k€ Adjoints aux directeurs (2) 
  </TEXT>
 </DOC>
</DOCSET>

EmptyLemmaSample.zip

Undesired behaviour : empty lemma in a bowToken

I am using a dedicated pipeline named BriceaAmoseAnalysis in analyzeXml to prepare multimedia documents for indexation in the AMOSE search engine.

analyzeXml -l fre -p BriceaAmoseAnalysis 00734_sample.xml

You can use readMultFile on the 00734_sample.xml.mult binary output to see that
the bowTerm id=33 contains a bowNamedEntity id=34 of type Numex.UNIT with an empty lemma

readMultFile 00734_sample.xml.mult

...
            <bowTerm id="33" lemma="_directeur" category="NC" position="718" length="25">
              <parts head="0">
                <bowNamedEntity id="34" lemma="" category="NC" position="718" length="1" type="Numex.UNIT">
                  <parts head="0">
                    <bowToken id="35" lemma="" category="NC" position="718" length="1"/>
                  </parts>
                  <feature name="value" value=""/>
                </bowNamedEntity>
                <bowRelation realization="au+les" type="3"/>
                <bowToken id="36" lemma="directeur" category="NC" position="733" length="10"/>
              </parts>
            </bowTerm>
...

I identified that the symbol € (and others like $ and £) has no dedicated entry in the French lefff dictionary of lima_linguisticdata/analysisDictionary/ .

When I had them in the dictionary, the problem seems to be solved , but I expected to see another behaviour.

@@ -97,6 +97,9 @@ $     mâle            ADJ
?!?!   !?              PONCTU_FORTE
???    ?               PONCTU_FORTE
@      à               P
+€      euro            NC:ms-
+\$     dollar          NC:ms-
+£      livre           NC:fs-
ACoruña        LaCorogne               NPP:fs-
ADSL   ADSL            NC:m--
AFP    AFP             NPP:fs-
@@ -126982,6 +126985,7 @@ canasson      canasson                NC:ms-

Expected behaviour

I expected to see that unknown words are left intact when no normalizing lemma is found in the dictionnary.
I checked the tokenizer rules as best as I could in lima_linguisticdata/scratch/LinguisticProcessings/fre/tokenizerAutomaton-fre.chars.tok and read that the 20AC UTF8 char is correctly defined as the € euro symbol

@kleag , @victorbocharov , Do you have any hints to solve this problem in the proper way ?

victorbocharov commented 4 years ago

I confirm this with pipeline "main" and ConllDumper. morphoData has one element with an empty lemma. The same happens without SpecificEntities.

@kleag is this the intended behavior for OOV?

kleag commented 4 years ago

@kleag is this the intended behavior for OOV? No, if no lemma is found, the raw text should be used as the lemma. I'll have a look to that.

kleag commented 4 years ago

I was not able to find where the lemma is lost. I tried the change bellow (You can ignore the new debug messages) but even if the RecognizerMatch:getNormalizedString result: is now not empty, the resulting lemma is still empty in the final BoWToken. I give the hand to @romaricb to go deeper in the problem.

diff --git lima_linguisticprocessing/src/linguisticProcessing/core/Automaton/recognizerMatch.cpp lima_linguisticprocessing/src/linguisticProcessing/core/Automaton/recognizerMatch.cpp
index 11a957bd..556b06d3 100644
--- lima_linguisticprocessing/src/linguisticProcessing/core/Automaton/recognizerMatch.cpp
+++ lima_linguisticprocessing/src/linguisticProcessing/core/Automaton/recognizerMatch.cpp
@@ -222,6 +222,10 @@ LimaString RecognizerMatch::getString() const {
 }

 LimaString RecognizerMatch::getNormalizedString(const FsaStringsPool& sp) const {
+#ifdef DEBUG_LP
+  AULOGINIT;
+  LDEBUG << "RecognizerMatch:getNormalizedString" << empty();
+#endif
   LimaString str;
   uint64_t currentPosition(0);
   if (empty()) {
@@ -239,11 +243,21 @@ LimaString RecognizerMatch::getNormalizedString(const FsaStringsPool& sp) const
         firstHyphenPassed = true;
       }
       MorphoSyntacticData* data = get(vertex_data,*(m_graph->getGraph()),v);
+#ifdef DEBUG_LP
+      LDEBUG << "RecognizerMatch:getNormalizedString data" << data;
+#endif

-      if (data==0 || data->empty()) {
+      if (data==0 || data->empty()
+        || sp[data->front().normalizedForm].isEmpty()) {
+#ifdef DEBUG_LP
+        LDEBUG << "RecognizerMatch:getNormalizedString stringForm" << t->stringForm();
+#endif
         str += t->stringForm();
       }
       else {
+#ifdef DEBUG_LP
+        LDEBUG << "RecognizerMatch:getNormalizedString first norm" << sp[data->front().normalizedForm];
+#endif
         // take first norm
         str += sp[data->front().normalizedForm];
       }
@@ -279,7 +293,8 @@ LimaString RecognizerMatch::getNormalizedString(const FsaStringsPool& sp) const
           }
         }

-        if (data == 0 || data->empty()) {
+        if (data == 0 || data->empty()
+          || sp[data->front().normalizedForm].isEmpty()) {
           str += t->stringForm();
         }
         else {
@@ -292,6 +307,9 @@ LimaString RecognizerMatch::getNormalizedString(const FsaStringsPool& sp) const
     }
     i++;
   }
+#ifdef DEBUG_LP
+  LDEBUG << "RecognizerMatch:getNormalizedString result:" << str;
+#endif
   return str;
 }

romaricb commented 4 years ago

I tracked the problem in the defaultProperties unit: the unknown words are associated with a lemma which is an unmarked form of themselves and the CharChart unmark() function for the euro symbol returns "". Some tokens are not unmarked (e.g. numbers), depending on their tokenizer status (listed in the conf file, parameter skipUnmarkStatus of defaultProperties). The tokenizer status of € is t_small: adding the t_small status in the skipUnmarkStatus leads to the expected behavior.

Form there, what is the correct solution ?

modify character handling/tokenizer to have a better handling of symbols/delimiters : € is a c_del1, I'm not sure why the token is t_small; should all delimiters be unmarked as "" ?
never unmark unknown words (would lose the matching between Unknown/unknown/UNKNOWN)
try and unmark and if it leads to an empty string, go back to original string

I suggest to at least implement the 3rd one as a safety.

@kleag ?

kleag commented 4 years ago

OK with the 3rd proposition @romaricb

romaricb commented 4 years ago

Done in commit c21269f2bbfd56efcd6cb1ef2bc15421192d61d4.

aymara / lima

Empty lemma in bowToken #107

Undesired behaviour : empty lemma in a bowToken

Expected behaviour