UniversalDependencies / UD_Persian-Seraji

UD_Persian
Other
27 stars 4 forks source link

Empty lemmas of some words #3

Closed martinpopel closed 7 years ago

martinpopel commented 7 years ago

In UD_Persian 2.0, about 6% of words have empty lemma (underscore in CoNLL-U). What is the reason?

If these are punctuation symbols or foreign-origin/proper/indeclinable/whatever words where lemmatization is not considered interesting from the linguistic point of view, the lemma should be equal to the form. If these are missing annotations planned to be done in future, I would still prefer to approximate the lemma (at least with the word form) automatically and mark the nodes as ToDo=fix-lemma in MISC.

Motivation: This causes problems to parsers, which expect either no lemmas at all, or a reasonable number of word forms per one lemma. Ideally, the word form should be uniquely defined by the lemma and FEATS (except for capitalization and other orthographic synonyms).

jnivre commented 7 years ago

Thanks, Martin. Hopefully, this can be fixed for the next release.

mojgan-seraji commented 7 years ago

Thank you Martin for your message and sorry for the late reply!

The reason is that I missed the deadline for the previous release. My email adress was either slept or paralyzed i the stp-email-list so I never got informed about the UD's early release in spring until the last day in the afternoon.

Thanks to Dan, Filip, and Yourself we managed to keep Persian alive for this release only by converting the version 1.4 to 2.0. In other words, all my updates including the completed lemmas and some fixed errors, are not presented in this version. Due to time constraints, I couldn't fix the errors on time after validation. I am sure you were not aware of it but Dan and Filip are more in the process!

However, the good thing is that these will be taken care of in the next release! Thanks again!

martinpopel commented 7 years ago

That's great, thanks. (Dan informed me meanwhile.) So I am closing this issue (I had suggestions for improvements in several UD treebanks based on reviewing CoNLL2017 papers, so I've decided to write them as GitHub issues, so I don't forget it).

jhdeov commented 1 month ago

Was this error fixed? I downloaded the latest version and I found this:

sent_id = train-s1739
text = اگر بر تمام موانع از ابتدا غلبه می‌شد، هیچ کوششی انجام نمی‌گرفت. translit = āgr br tmām mūānʿ az abtdā ġlbh mīšd, hīč kūššī anjām nmīgrft.

Word 13 is نمی‌گرفت but with lemma _

The lemmatizer also has word 8 می‌شد as کرد