aymara / lima

The Libre Multilingual Analyzer, a Natural Language Processing (NLP) C++ toolkit.
http://aymara.github.io/lima/
Other
104 stars 20 forks source link

When using deelima backend, pos tags from first text are reused for the following ones #174

Closed kleag closed 4 months ago

kleag commented 4 months ago

Describe the bug When using Lima with the deeplima backend, the PoS tagging is correct if you analyze one file but the tags from this file are reused for those following.

To Reproduce Steps to reproduce the behavior:

  1. Analyse a text file with the deeplima pipeline
  2. Check that pos tags are overly correct
  3. Restart the analysis but with two files
  4. See that the second file tags are wrong, they are those from the first one.

Expected behavior All files PoS tags should be correct

Screenshots

❯ analyzeText -l ud --meta udlang:eng-UD_English-EWT -p deeplima test-eng12.txt
2024-05-15 16:05:24.148156: I /build/tensorflow-for-lima-JkNXYb/tensorflow-for-lima-1.9/tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA

Analyzing 1/1 (100.00%) 'test-eng12.txt'# sent_id = 1
# text = The Airbus A380 is the largest airplane in the world
1       The     _       DET     _       Definite=Def|PronType=Art       4       det     _       Pos=1|Len=3|SpaceAfter=No
2       Airbus  _       PROPN   _       Number=Sing     4       compound        _       Pos=4|Len=6
3       A       _       NOUN    _       Number=Sing     4       compound        _       Pos=11|Len=1
4       380     _       PROPN   _       Number=Sing     8       nsubj   _       Pos=13|Len=3|SpaceAfter=No
5       is      _       AUX     _       Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   8       cop     _       Pos=16|Len=2
6       the     _       DET     _       Definite=Def|PronType=Art       8       det     _       Pos=19|Len=3
7       largest _       ADJ     _       Degree=Sup      8       amod    _       Pos=23|Len=7
8       airplane        _       NOUN    _       Number=Sing     0       root    _       Pos=31|Len=8
9       in      _       ADP     _       _       11      case    _       Pos=40|Len=2
10      the     _       DET     _       Definite=Def|PronType=Art       11      det     _       Pos=43|Len=3
11      world   _       NOUN    _       Number=Sing     8       nmod    _       Pos=47|Len=5
12      .       _       PUNCT   _       _       8       punct   _       Pos=53|Len=1|SpaceAfter=No

# sent_id = 2
# text =  It is used by Air France and Japan Airlines
1       It      _       PRON    _       Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  3       nsubj   _       Pos=54|Len=2
2       is      _       AUX     _       Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   3       aux     _       Pos=57|Len=2
3       used    _       VERB    _       Tense=Past|VerbForm=Part|Voice=Pass     0       root    _       Pos=60|Len=4
4       by      _       ADP     _       _       6       case    _       Pos=65|Len=2
5       Air     _       PROPN   _       Number=Sing     6       compound        _       Pos=68|Len=3
6       France  _       PROPN   _       Number=Sing     3       obl     _       Pos=72|Len=6
7       and     _       CCONJ   _       _       9       cc      _       Pos=79|Len=3
8       Japan   _       PROPN   _       Number=Sing     9       compound        _       Pos=83|Len=5
9       Airlines        _       PROPN   _       Number=Plur     6       conj    _       Pos=89|Len=8
10      .       _       PUNCT   _       _       3       punct   _       Pos=98|Len=1

❯ 
❯ 
❯ 
❯ analyzeText -l ud --meta udlang:eng-UD_English-EWT -p deeplima test-eng11.txt test-eng12.txt
2024-05-15 16:09:00.904687: I /build/tensorflow-for-lima-JkNXYb/tensorflow-for-lima-1.9/tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
Analyzing 1/2 (50.00%) 'test-eng11.txt'# sent_id = 1
# text = * Sylva and Shining Cliff Woods are at Inverleith House, Edinburgh, open from tomorrow daily 11am-3.30pm until January 29
1       *       _       PUNCT   _       _       14      punct   _       Pos=1|Len=1|SpaceAfter=No
2       Sylva   _       PROPN   _       Number=Sing     14      nsubj   _       Pos=2|Len=5
3       and     _       CCONJ   _       _       6       cc      _       Pos=8|Len=3
4       Shining _       PROPN   _       VerbForm=Ger    5       amod    _       Pos=12|Len=7
5       Cliff   _       PROPN   _       Number=Sing     6       compound        _       Pos=20|Len=5
6       Woods   _       PROPN   _       Number=Plur     10      nsubj   _       Pos=26|Len=5
7       are     _       AUX     _       Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   10      cop     _       Pos=32|Len=3
8       at      _       ADP     _       _       10      case    _       Pos=36|Len=2
9       Inverleith      _       PROPN   _       Number=Sing     10      compound        _       Pos=39|Len=10
10      House   _       PROPN   _       Number=Sing     0       root    _       Pos=50|Len=5
11      ,       _       PUNCT   _       _       10      punct   _       Pos=56|Len=1|SpaceAfter=No
12      Edinburgh       _       PROPN   _       Number=Sing     10      appos   _       Pos=57|Len=9
13      ,       _       PUNCT   _       _       10      punct   _       Pos=67|Len=1|SpaceAfter=No
14      open    _       ADJ     _       Degree=Pos      10      amod    _       Pos=68|Len=4
15      from    _       ADP     _       _       16      case    _       Pos=73|Len=4
16      tomorrow        _       NOUN    _       Number=Sing     14      obl     _       Pos=78|Len=8
17      daily   _       ADJ     _       Degree=Pos      16      amod    _       Pos=87|Len=5
18      11      _       NUM     _       NumType=Card    19      nummod  _       Pos=93|Len=2
19      am      _       NOUN    _       Number=Sing     14      obl     _       Pos=96|Len=2|SpaceAfter=No
20      -       _       SYM     _       _       22      case    _       Pos=98|Len=1|SpaceAfter=No
21      3.30    _       NUM     _       NumType=Card    22      nummod  _       Pos=99|Len=4|SpaceAfter=No
22      pm      _       NOUN    _       Number=Sing     17      nmod    _       Pos=103|Len=2|SpaceAfter=No
23      until   _       ADP     _       _       24      case    _       Pos=105|Len=5
24      January _       PROPN   _       Number=Sing     14      obl     _       Pos=111|Len=7
25      29      _       NUM     _       NumType=Card    24      nummod  _       Pos=119|Len=2
26      .       _       PUNCT   _       _       10      punct   _       Pos=122|Len=1
Analyzing 2/2 (100.00%) 'test-eng12.txt'# sent_id = 1
# text = The Airbus A380 is the largest airplane in the world
 : LP::Dumper : 2024-05-15T16:10:41.509 ERROR 0x60e595ca8080 ConllDumper::process target 15 not found in segmentation mapping 
1       The     _       PUNCT   _       _       0       punct   _       Pos=1|Len=3|SpaceAfter=No
 : LP::Dumper : 2024-05-15T16:10:41.509 ERROR 0x60e595ca8080 ConllDumper::process target 15 not found in segmentation mapping 
2       Airbus  _       PROPN   _       Number=Sing     0       nsubj   _       Pos=4|Len=6
3       A       _       CCONJ   _       _       6       cc      _       Pos=11|Len=1
4       380     _       PROPN   _       VerbForm=Ger    5       amod    _       Pos=13|Len=3|SpaceAfter=No
5       is      _       PROPN   _       Number=Sing     6       compound        _       Pos=16|Len=2
6       the     _       PROPN   _       Number=Plur     10      nsubj   _       Pos=19|Len=3
7       largest _       AUX     _       Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   10      cop     _       Pos=23|Len=7
8       airplane        _       ADP     _       _       10      case    _       Pos=31|Len=8
9       in      _       PROPN   _       Number=Sing     10      compound        _       Pos=40|Len=2
10      the     _       PROPN   _       Number=Sing     0       root    _       Pos=43|Len=3
11      world   _       PUNCT   _       _       10      punct   _       Pos=47|Len=5
12      .       _       PROPN   _       Number=Sing     10      appos   _       Pos=53|Len=1|SpaceAfter=No

# sent_id = 2
# text =  It is used by Air France and Japan Airlines
 : LP::Dumper : 2024-05-15T16:10:41.511 ERROR 0x60e595ca8080 ConllDumper::process target 11 not found in segmentation mapping 
1       It      _       PUNCT   _       _       0       punct   _       Pos=54|Len=2
 : LP::Dumper : 2024-05-15T16:10:41.511 ERROR 0x60e595ca8080 ConllDumper::process target 11 not found in segmentation mapping 
2       is      _       ADJ     _       Degree=Pos      0       amod    _       Pos=57|Len=2
3       used    _       ADP     _       _       4       case    _       Pos=60|Len=4
4       by      _       NOUN    _       Number=Sing     2       obl     _       Pos=65|Len=2
5       Air     _       ADJ     _       Degree=Pos      4       amod    _       Pos=68|Len=3
6       France  _       NUM     _       NumType=Card    7       nummod  _       Pos=72|Len=6
7       and     _       NOUN    _       Number=Sing     2       obl     _       Pos=79|Len=3
8       Japan   _       SYM     _       _       10      case    _       Pos=83|Len=5
9       Airlines        _       NUM     _       NumType=Card    10      nummod  _       Pos=89|Len=8
10      .       _       NOUN    _       Number=Sing     5       nmod    _       Pos=98|Len=1
kleag commented 4 months ago

Corrected in b7512fad