LR-POR / cl-conllu

tool for working with conllu files in CL
Apache License 2.0
13 stars 5 forks source link

(parse) handle case where HEAD is empty #69

Closed odanoburu closed 5 years ago

odanoburu commented 5 years ago

UD treebanks must have non-empty HEADs, but that's no true for all CoNLL-U files.

vcvpaiva commented 5 years ago

Why accept an invalid file would make sense?

Because we also want to parse nominal phrases, that are not full sentences? or fragments of sentences?

arademaker commented 5 years ago

Not really. Even an NP or a sentence fragment need to be a rooted tree. Moreover, conllu spec doesn’t allow empty HEAD.

odanoburu commented 5 years ago

the CoNLL-U spec does allow empty HEADs:

Underscore (_) is used to denote unspecified values in all fields except ID. Note that no format-level distinction is made for the rare cases where the FORM or LEMMA is the literal underscore – processing in such cases is application-dependent. Further, in UD treebanks the UPOS, HEAD, and DEPREL columns are not allowed to be left unspecified.

for instance, I just tokenized and tagged a corpus, without parsing it syntactically: all of its HEADs are empty, and I can't read it. it is a valid CoNLL-U file, however.

arademaker commented 5 years ago

Well, if you run the CoreNLP from Stanford, asking for tokenization and splitting only and the conllu output, you will get

1   The _   _   _   _   _   _   _   _
2   experiments _   _   _   _   _   _   _   _
3   presented   _   _   _   _   _   _   _   _
4   here    _   _   _   _   _   _   _   _
5   demonstrate _   _   _   _   _   _   _   _
6   other   _   _   _   _   _   _   _   _
7   possible    _   _   _   _   _   _   _   _
8   scenarios   _   _   _   _   _   _   _   _
9   combining   _   _   _   _   _   _   _   _
10  folding _   _   _   _   _   _   _   _
11  and _   _   _   _   _   _   _   _
12  thrusting   _   _   _   _   _   _   _   _
13  to  _   _   _   _   _   _   _   _
14  obtain  _   _   _   _   _   _   _   _
15  simulta _   _   _   _   _   _   _   _
16  -   _   _   _   _   _   _   _   _
17  neous   _   _   _   _   _   _   _   _
18  salt    _   _   _   _   _   _   _   _
19  extrusion   _   _   _   _   _   _   _   _
20  and _   _   _   _   _   _   _   _
21  sediment    _   _   _   _   _   _   _   _
22  incorporation   _   _   _   _   _   _   _   _
23  within  _   _   _   _   _   _   _   _
24  salt    _   _   _   _   _   _   _   _
25  .   _   _   _   _   _   _   _   _

1   On  _   _   _   _   _   _   _   _
2   such    _   _   _   _   _   _   _   _
3   a   _   _   _   _   _   _   _   _
4   basis   _   _   _   _   _   _   _   _
5   ,   _   _   _   _   _   _   _   _
6   we  _   _   _   _   _   _   _   _
7   propose _   _   _   _   _   _   _   _
8   an  _   _   _   _   _   _   _   _
9   alternative _   _   _   _   _   _   _   _
10  interpretation  _   _   _   _   _   _   _   _
11  -LRB-   _   _   _   _   _   _   _   _
12  Fig.    _   _   _   _   _   _   _   _
13  20c _   _   _   _   _   _   _   _
14  -RRB-   _   _   _   _   _   _   _   _
15  .   _   _   _   _   _   _   _   _

So I am now convinced that we need to relax more on the read-conllu function. But I would prefere a more robust change dealing with the output above.

arademaker commented 5 years ago

Commit a7d497a was inspired by this PR. thank you.