clarinsi / classla

CLASSLA Fork of the Official Stanford NLP Python Library for Many Human Languages
https://www.clarin.si/info/k-centre/
Other
37 stars 17 forks source link

UD validate errors #10

Closed TomazErjavec closed 2 years ago

TomazErjavec commented 3 years ago

UD project has a validation script which I ran on the output of sl CLASSLA for the sentence "Iz teh dveh parametrov izračunamo še indeks telesne mase po enačbi: ITM = TT/TV2 [kg/m2]. (Glej preglednico 1.)", and it found some errors. Level 1 passes ok, but Level 2 complains:

python3 tools/validate.py --lang=sl --level 2 test2.conll
[Line 6 Sent 1.1]: [L2 Morpho unsorted-features] Morphological features must be sorted: 'Case=Gen|Gender=Masc|NumForm=Word|NumType=Card|Number=Dual'.
[Line 18 Sent 1.1]: [L2 Syntax invalid-deprel] Invalid DEPREL value 'flat_foreign'.
[Line 18 Sent 1.1]: [L2 Syntax unknown-deprel] Unknown DEPREL label: 'flat_foreign'
Morpho errors: 1
Syntax errors: 2
*** FAILED *** with 3 errors

The first one is pretty self explanatory: CoNLL-U requires that features are sorted, so Number=Dual should come before NumForm=Word.

The second one complains because extended DepRels should have their parts separated by colon, not undrescore, so flat:foreign and not flat_foreign. This is, I guess, my fault, because in the TEI annotated corpora, the DepRel cannot contain a colon (it is a reference to ID, and ID's cannot contain colons), so in my TEI corpora I, unfortunatelly, have to use underscore, and this principle got moved into the CoNLL-U output, which is, however, wrong, here it should be colon. Sorry for introducing this confusion, and in case you are interested, we had a debate how to still use colons in TEI, but the move wasn't sucessful...

I corrected the above 2 errors in my test file, and then ran Level 3 validations:

[Line 17 Sent 1.1 Node 14]: [L3 Syntax rel-upos-punct] 'punct' must be 'PUNCT' but it is 'SYM'
Syntax errors: 1
*** FAILED *** with 1 errors

Here "=" has UPoS SYM, but has the SynRel punct, which is forbidden in UD. No idea how this can be solved or how it came about. I doubt it is the training data, as that passed UD validation.

Not directly related, but still: when running my conllu2tei script over the complete corpus, I also check that /^[[:punct:]]+$/ tokens are not tagged with weird UPOS (they should be either PUNCT or SYM) , in it does find such cases ( 104 cases in a 250k corpus) , here are the first few:

WARN: changing UPOS to PUNCT for
24      ...     ...     X       Y       Abbr=Yes        2       punct   _       NER=O
WARN: changing UPOS to PUNCT for
12      (%)     (%)     NUM     Mdc     NumForm=Digit|NumType=Card      11      nummod  _       NER=O|SpaceAfter=No
WARN: changing UPOS to PUNCT for
16      ...     ...     X       Y       Abbr=Yes        15      nmod    _       NER=O|SpaceAfter=No
WARN: changing UPOS to PUNCT for
18      ]       ]       NOUN    Ncmsn   Case=Nom|Gender=Masc|Number=Sing        13      conj    _       NER=O|SpaceAfter=No
WARN: changing UPOS to SYM for
31      =(      =(      NOUN    Ncmsn   Case=Nom|Gender=Masc|Number=Sing        32      flat_foreign    _       NER=O|SpaceAfter=No
WARN: changing UPOS to PUNCT for
45      ))      ))      NOUN    Ncmsn   Case=Nom|Gender=Masc|Number=Sing        43      appos   _       NER=O|SpaceAfter=No
WARN: changing UPOS to PUNCT for
25      ]       ]       NOUN    Ncmsn   Case=Nom|Gender=Masc|Number=Sing        20      nmod    _       NER=O|SpaceAfter=No
WARN: changing UPOS to SYM for
18      =(      =(      NOUN    Ncmsn   Case=Nom|Gender=Masc|Number=Sing        19      nmod    _       NER=O|SpaceAfter=No

If possible, it might be a good idea (as with closed class words), to force UPOS of punctuation tokens to PUNCT/SYN, and simlirarly to NUM for digits, as they are also sometimes tagged as other UPOS.

nljubesi commented 3 years ago

TL;DR Luka, you should add the feature of sorting morphological features by performing attribute-based case-insensitive sorting, so sorted() should have an attribute key = lambda x: x.split('=')[0].casefold(). Currently, I guess, they are just sorted.

Thank you, Tomaž, for these points. I will summarize:

  1. morph features are sometimes not sorted (at least not in a way the validate.py script expects)
  2. underscore is used in extended DepRels, not colons
  3. we are breaking some UD:UPOS rules
  4. we lack rules for punctuation

Most of these were recently discussed on Redmine, most recently #1495. We are currently on the path of doing the following:

  1. Luka, we might improve this right away. This is pending for some time and should be simple enough. I am pretty sure the features are sorted already, but not case insensitive, as it seems that the script expects (the linked rule says "and be sorted alphabetically by attribute names", underdefined)
  2. While preparing new models for parsing we should transform all UD tags so that underscore becomes colon. To be done in the following months, whenever the need for new models arises.
  3. I do not think we will do anything with this soon, this is niche and might be also corrected during post-processing for those requiring it.
  4. This is actually already implemented in the recent commits in the main (not master) branch. Should be available via pip in the following days.
lkrsnik commented 3 years ago
  1. Luka, we might improve this right away. This is pending for some time and should be simple enough. I am pretty sure the features are sorted already, but not case insensitive, as it seems that the script expects (the linked rule says "and be sorted alphabetically by attribute names", underdefined)

This attribute sorting is performed when we train models. I modified code, so that when we train pos models next time, features will be sorted case insensitively.

lkrsnik commented 2 years ago

With classla v1.1.0 the test passed.