Closed TomazErjavec closed 2 years ago
TL;DR
Luka, you should add the feature of sorting morphological features by performing attribute-based case-insensitive sorting, so sorted()
should have an attribute key = lambda x: x.split('=')[0].casefold()
. Currently, I guess, they are just sorted.
Thank you, Tomaž, for these points. I will summarize:
Most of these were recently discussed on Redmine, most recently #1495. We are currently on the path of doing the following:
- Luka, we might improve this right away. This is pending for some time and should be simple enough. I am pretty sure the features are sorted already, but not case insensitive, as it seems that the script expects (the linked rule says "and be sorted alphabetically by attribute names", underdefined)
This attribute sorting is performed when we train models. I modified code, so that when we train pos models next time, features will be sorted case insensitively.
With classla v1.1.0 the test passed.
UD project has a validation script which I ran on the output of sl CLASSLA for the sentence "Iz teh dveh parametrov izračunamo še indeks telesne mase po enačbi: ITM = TT/TV2 [kg/m2]. (Glej preglednico 1.)", and it found some errors. Level 1 passes ok, but Level 2 complains:
The first one is pretty self explanatory: CoNLL-U requires that features are sorted, so
Number=Dual
should come beforeNumForm=Word
.The second one complains because extended DepRels should have their parts separated by colon, not undrescore, so
flat:foreign
and notflat_foreign
. This is, I guess, my fault, because in the TEI annotated corpora, the DepRel cannot contain a colon (it is a reference to ID, and ID's cannot contain colons), so in my TEI corpora I, unfortunatelly, have to use underscore, and this principle got moved into the CoNLL-U output, which is, however, wrong, here it should be colon. Sorry for introducing this confusion, and in case you are interested, we had a debate how to still use colons in TEI, but the move wasn't sucessful...I corrected the above 2 errors in my test file, and then ran Level 3 validations:
Here "=" has UPoS SYM, but has the SynRel punct, which is forbidden in UD. No idea how this can be solved or how it came about. I doubt it is the training data, as that passed UD validation.
Not directly related, but still: when running my conllu2tei script over the complete corpus, I also check that
/^[[:punct:]]+$/
tokens are not tagged with weird UPOS (they should be either PUNCT or SYM) , in it does find such cases ( 104 cases in a 250k corpus) , here are the first few:If possible, it might be a good idea (as with closed class words), to force UPOS of punctuation tokens to PUNCT/SYN, and simlirarly to NUM for digits, as they are also sometimes tagged as other UPOS.