aymara / lima

The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
http://aymara.github.io/lima/
Other
107 stars 21 forks source link

Error in the analysis of consecutive numeric entities #50

Closed kleag closed 7 years ago

kleag commented 8 years ago

After named entities, we get for "1234 3.2 4,5":

<specific_entities>
<specific_entity>
  <string>1234 3.2</string>
  <position>1</position>
  <length>8</length>
  <type>Numex.NUMBER</type>
</specific_entity>
<specific_entity>
  <string>1234 3.2 4,5</string>
  <position>1</position>
  <length>12</length>
  <type>Numex.NUMBER</type>
</specific_entity>
</specific_entities>

while we should get three different entities.

Modex rules can be improved but not completly because we cannot have a numeric transition on real numbers, only on integers.

I tried to change the code to allow transitions on real numbers but it does not work. My try is on branch AutomatonTransitionOnDouble. I probably forgot to change something somewhere but I cannot figure out. .

kleag commented 8 years ago

My correction allows to better detect some entities but for a generic correction, the work started in th branch should be continued.

kleag commented 7 years ago

@romaricb could you have a look at that, please ?

romaricb commented 7 years ago

The problem appears only in english. The rules for the recognition of numbers contain

@Number=($NOMBRE) @Number::(@Number|million|billion){0-n}:NUMBER:=>NormalizeNumber()

Since all numeric forms are associated with the POS $NOMBRE, this rule merges all consecutive numeric forms into one.

Changing this rule to @Number::(million|billion){0-n}:NUMBER:=>NormalizeNumber() could correct this issue, but this rule was there to handle text forms of numbers (three hundred thousand)... We may need a way of differentiating text forms and numeric forms of numbers.

romaricb commented 7 years ago

corrected rules to take this problem into account (I used a list of text forms of numbers in the rules files to have more explicit rules.)

kleag commented 7 years ago

Note that the correction was in commit de70ea86736c94f9ff9a6d2b9f5035f87862d769