chewxy / lingo

package lingo provides the data structures and algorithms required for natural language processing
MIT License
151 stars 15 forks source link

Issue training on ud-treebanks #16

Open neetle opened 5 years ago

neetle commented 5 years ago

Trying to train a treebank against ud-treebanks-v2.3/UD_English-EWT/en_ewt-ud-dev.conllu and I've noticed that all rows that have a head field with the value _ panic.

Any ideas on how to deal with this data within the library? Happy to submit a PR on any advice given.


# sent_id = answers-20111108072305AAPJTjj_ans-0005
# text = It's more compact, ISO 6400 capability (SX40 only 3200), faster lens at f/2 and the SX40 only f/2.7.
1   It  it  PRON    PRP Case=Nom|Gender=Neut|Number=Sing|Person=3|PronType=Prs  4   nsubj   4:nsubj SpaceAfter=No
2   's  be  AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4   cop 4:cop   _
3   more    more    ADV RBR _   4   advmod  4:advmod    _
4   compact compact ADJ JJ  Degree=Pos  0   root    0:root  SpaceAfter=No
5   ,   ,   PUNCT   ,   _   8   punct   8:punct _
6   ISO iso NOUN    NN  Number=Sing 8   compound    8:compound  _
7   6400    6400    NUM CD  NumType=Card    6   nummod  6:nummod    _
8   capability  capability  NOUN    NN  Number=Sing 4   list    4:list  _
9   (   (   PUNCT   -LRB-   _   10  punct   10:punct|10.1:punct SpaceAfter=No
10  SX40    SX40    PROPN   NNP Number=Sing 8   parataxis   8:parataxis|10.1:nsubj  _
#dies on the following line
10.1    has have    VERB    VBZ _   _   _   8:parataxis CopyOf=-1
11  only    only    ADV RB  _   12  advmod  12:advmod   _
12  3200    3200    NUM CD  NumType=Card    10  orphan  10.1:obj    SpaceAfter=No
13  )   )   PUNCT   -RRB-   _   10  punct   10:punct|10.1:punct SpaceAfter=No
14  ,   ,   PUNCT   ,   _   8   punct   8:punct _
15  faster  faster  ADJ JJR Degree=Cmp  16  amod    16:amod _
16  lens    lens    NOUN    NN  Number=Sing 4   list    4:list  _
17  at  at  ADP IN  _   18  case    18:case _
18  f/2 f/2 NOUN    NN  Number=Sing 16  nmod    16:nmod:at  _
19  and and CCONJ   CC  _   21  cc  21:cc|21.1:cc   _
20  the the DET DT  Definite=Def|PronType=Art   21  det 21:det  _
21  SX40    SX40    PROPN   NNP Number=Sing 16  conj    16:conj:and|21.1:nsubj  _
21.1    has have    VERB    VBZ _   _   _   16:conj:and CopyOf=-1
22  only    only    ADJ JJ  Degree=Pos  23  amod    23:amod _
23  f   f   NOUN    NN  Number=Sing 21  orphan  21.1:obj    SpaceAfter=No
24  /   /   PUNCT   ,   _   23  punct   23:punct    SpaceAfter=No
25  2.7 2.7 NUM CD  NumType=Card    23  nummod  23:nummod   SpaceAfter=No
26  .   .   PUNCT   .   _   4   punct   4:punct _```
neetle commented 5 years ago

I think this is relevant, but @ work so I can't verify it's the case. If it is, I'll see if I can extend lingo to deal with gapping entries.

durp commented 5 years ago

Since these are "empty nodes", isn't the expedient thing to simply skip word lines with an unspecified (_) head?

neetle commented 5 years ago

possibly - I believe (from what I can remember) that they're actually implied in the sentence. Not sure if giving up that fidelity is worth it.

chewxy commented 4 years ago

I am so sorry for not responding. apparently this library escaped my notifications. So I wasn't notified of incoming PRs and issues.