UniversalDependencies / tools

Various utilities for processing the data.
GNU General Public License v2.0
203 stars 43 forks source link

Validation rule for Foreign feature #87

Closed bguil closed 2 years ago

bguil commented 2 years ago

Using validate.py for some French data, I had the following error:

[L4 Morpho feature-upos-not-permitted] Feature Foreign is not permitted with UPOS X in language [br]

for the CoNLL line:

11  maen    maen    X   _   Foreign=Yes 10  appos   _   Lang=br|SpaceAfter=No

I think it would be sensible to allow the feature Foreign=Yes on the X tag whatever is the language.

dan-zeman commented 2 years ago

In general I agree (not only for the X tag but perhaps for any tag). But I am hesitant to hard-code it in the validator when checking the boxes in the form is not too much work and it is then neatly visible alongside all other features.

There is one issue though that I have not solved yet and that makes the feature Foreign special anyway. The attribute Lang=br in MISC indicates that morphological features in FEATS, if present, are Breton rather than French. However, the feature Foreign should probably be a (hard-coded) exception because:

tlynn747 commented 2 years ago

I might also suggest taking a look at the guideline suggestions for UGC (Section 4.7) in our recent journal article:

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations https://link.springer.com/content/pdf/10.1007/s10579-022-09581-9.pdf

dan-zeman commented 2 years ago

The validator should now judge the Foreign feature according to the main language of the corpus, regardless of Lang=xx in MISC.