Closed odanoburu closed 5 years ago
Why accept an invalid file would make sense?
Because we also want to parse nominal phrases, that are not full sentences? or fragments of sentences?
Not really. Even an NP or a sentence fragment need to be a rooted tree. Moreover, conllu spec doesn’t allow empty HEAD.
the CoNLL-U spec does allow empty HEADs:
Underscore (_) is used to denote unspecified values in all fields except ID. Note that no format-level distinction is made for the rare cases where the FORM or LEMMA is the literal underscore – processing in such cases is application-dependent. Further, in UD treebanks the UPOS, HEAD, and DEPREL columns are not allowed to be left unspecified.
for instance, I just tokenized and tagged a corpus, without parsing it syntactically: all of its HEADs are empty, and I can't read it. it is a valid CoNLL-U file, however.
Well, if you run the CoreNLP from Stanford, asking for tokenization and splitting only and the conllu output, you will get
1 The _ _ _ _ _ _ _ _
2 experiments _ _ _ _ _ _ _ _
3 presented _ _ _ _ _ _ _ _
4 here _ _ _ _ _ _ _ _
5 demonstrate _ _ _ _ _ _ _ _
6 other _ _ _ _ _ _ _ _
7 possible _ _ _ _ _ _ _ _
8 scenarios _ _ _ _ _ _ _ _
9 combining _ _ _ _ _ _ _ _
10 folding _ _ _ _ _ _ _ _
11 and _ _ _ _ _ _ _ _
12 thrusting _ _ _ _ _ _ _ _
13 to _ _ _ _ _ _ _ _
14 obtain _ _ _ _ _ _ _ _
15 simulta _ _ _ _ _ _ _ _
16 - _ _ _ _ _ _ _ _
17 neous _ _ _ _ _ _ _ _
18 salt _ _ _ _ _ _ _ _
19 extrusion _ _ _ _ _ _ _ _
20 and _ _ _ _ _ _ _ _
21 sediment _ _ _ _ _ _ _ _
22 incorporation _ _ _ _ _ _ _ _
23 within _ _ _ _ _ _ _ _
24 salt _ _ _ _ _ _ _ _
25 . _ _ _ _ _ _ _ _
1 On _ _ _ _ _ _ _ _
2 such _ _ _ _ _ _ _ _
3 a _ _ _ _ _ _ _ _
4 basis _ _ _ _ _ _ _ _
5 , _ _ _ _ _ _ _ _
6 we _ _ _ _ _ _ _ _
7 propose _ _ _ _ _ _ _ _
8 an _ _ _ _ _ _ _ _
9 alternative _ _ _ _ _ _ _ _
10 interpretation _ _ _ _ _ _ _ _
11 -LRB- _ _ _ _ _ _ _ _
12 Fig. _ _ _ _ _ _ _ _
13 20c _ _ _ _ _ _ _ _
14 -RRB- _ _ _ _ _ _ _ _
15 . _ _ _ _ _ _ _ _
So I am now convinced that we need to relax more on the read-conllu
function. But I would prefere a more robust change dealing with the output above.
Commit a7d497a was inspired by this PR. thank you.
UD treebanks must have non-empty HEADs, but that's no true for all CoNLL-U files.