endu50 / mate-tools

Automatically exported from code.google.com/p/mate-tools
0 stars 0 forks source link

Parser kills first token of each sentence #8

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
We build an implementation of the mate tools version 3.5 with input injection 
of the CollReader format (one token per line, separated by a line break per 
sentence). 
The Parser however shows some strange behaviour, where it deletes the first 
token of each sentence and starts with the second. This is a relatively new 
issue and might has something to do with the input format/encoding. The 
Lemmatizer and POS-tagger however work fine. All the data is encoded in UTF-8.

Example output (Der Buchstabe A hat eine durchschnittliche Häufigkeit von 
6.51%.):
 -------- TOKEN FORMS @AFTER PARSE
2       Buchstabe       _       buchstabe       _       NN      _       
case=nom|number=sg|gender=masc  -1      3       _       SB      _       _
3       A       _       --      _       NE      _       
case=nom|number=sg|gender=*     -1      1       _       NK      _       _
4       hat     _       haben   _       VAFIN   _       
number=sg|person=3|tense=pres|mood=ind  -1      0       _       --      _       
_
5       in      _       in      _       APPR    _       _       -1      3       
_       MO      _       _
6       deutschen       _       deutsch _       ADJA    _       
case=dat|number=pl|gender=fem|degree=pos        -1      6       _       NK      
_       _
7       Texten  _       text    _       NN      _       
case=dat|number=pl|gender=fem   -1      4       _       NK      _       _
8       eine    _       ein     _       ART     _       
case=acc|number=sg|gender=fem   -1      9       _       NK      _       _
9       durchschnittliche       _       durchschnittlich        _       ADJA    
_       case=acc|number=sg|gender=fem|degree=pos        -1      9       _       
NK      _       _
10      Häufigkeit      _       häufigkeit      _       NN      _       
case=acc|number=sg|gender=fem   -1      3       _       OA      _       _
11      von     _       von     _       APPR    _       _       -1      9       
_       MNR     _       _
12      6,51    _       6,51    _       CARD    _       _       -1      12      
_       NK      _       _
13      %       _       %       _       NN      _       
case=*|number=*|gender=neut     -1      10      _       NK      _       _
14      .       _       --      _       $.      _       _       -1      12      
_       --      _       _

Thanks

Original issue reported on code.google.com by micha.h...@gmail.com on 31 Oct 2013 at 11:15

GoogleCodeExporter commented 8 years ago
Just for comparison I ran the previous version 3.3. It seem to work with this 
version. Thus I limit this issue to 3.5 only!

Original comment by micha.h...@gmail.com on 31 Oct 2013 at 11:43

GoogleCodeExporter commented 8 years ago
I fixed that problem and found a accuracy issue related with this. I changed 
the interal interfaces to check if a root token was included if the input does 
not include a root token then it adds one in a consisten way. Root tokens 
should *not* anynore included. 

I uploaded a new version 3.6. Do not use version 3.5 with the internal 
interfaces. 

Original comment by boh...@informatik.uni-stuttgart.de on 13 Nov 2013 at 5:22

GoogleCodeExporter commented 8 years ago

Original comment by boh...@informatik.uni-stuttgart.de on 13 Nov 2013 at 5:25