Closed martinpopel closed 7 years ago
Thanks, Martin. Hopefully, this can be fixed for the next release.
Thank you Martin for your message and sorry for the late reply!
The reason is that I missed the deadline for the previous release. My email adress was either slept or paralyzed i the stp-email-list so I never got informed about the UD's early release in spring until the last day in the afternoon.
Thanks to Dan, Filip, and Yourself we managed to keep Persian alive for this release only by converting the version 1.4 to 2.0. In other words, all my updates including the completed lemmas and some fixed errors, are not presented in this version. Due to time constraints, I couldn't fix the errors on time after validation. I am sure you were not aware of it but Dan and Filip are more in the process!
However, the good thing is that these will be taken care of in the next release! Thanks again!
That's great, thanks. (Dan informed me meanwhile.) So I am closing this issue (I had suggestions for improvements in several UD treebanks based on reviewing CoNLL2017 papers, so I've decided to write them as GitHub issues, so I don't forget it).
Was this error fixed? I downloaded the latest version and I found this:
sent_id = train-s1739
text = اگر بر تمام موانع از ابتدا غلبه میشد، هیچ کوششی انجام نمیگرفت.
translit = āgr br tmām mūānʿ az abtdā ġlbh mīšd, hīč kūššī anjām nmīgrft.
Word 13 is نمیگرفت but with lemma _
The lemmatizer also has word 8 میشد as کرد
In UD_Persian 2.0, about 6% of words have empty lemma (underscore in CoNLL-U). What is the reason?
If these are punctuation symbols or foreign-origin/proper/indeclinable/whatever words where lemmatization is not considered interesting from the linguistic point of view, the lemma should be equal to the form. If these are missing annotations planned to be done in future, I would still prefer to approximate the lemma (at least with the word form) automatically and mark the nodes as
ToDo=fix-lemma
in MISC.Motivation: This causes problems to parsers, which expect either no lemmas at all, or a reasonable number of word forms per one lemma. Ideally, the word form should be uniquely defined by the lemma and FEATS (except for capitalization and other orthographic synonyms).