Open kanayamah opened 1 year ago
@dan-zeman I understand that this is due to this commit for the validation based on limited lemmas of auxiliary. But it reduces the usability of the corpus - the users need to convert lemma using the MISC field.
Yes, I did it to let the Korean treebanks escape the NEGLECTED status and survive the next release. The OrigLemma
attribute in MISC is intended only as a temporary measure, as I did not want to lose information that may be vital.
A plus sign in LEMMA is problematic because it (probably) violates the principle that the lemma is an existing surface form of the lexeme, which is declared as the canonical/citation form in the paradigm. It rather suggests that there are two lemmas on one line, which may signal incomplete word segmentation. I cannot judge the consequences for XPOS tags because XPOS falls outside the UD guidelines, hence it is not documented in UD (at least I don't see it documented). Ideally there should be morphological features in FEATS that would explain the function of the suffix.
In fact, the lemma of the copula should probably be 이다; that seems to be the citation form that all sources use.
@dan-zeman first of all, thanks for your effort to keep the corpus:-)
I agree with you that the current Korean corpus (particularly LEMMA and XPOS fields) is not compliant with the UD principles. It is desirable to give better representation without +
mark, but I think it is really confusing to partially modify the corpus.
BTW, the new proposal in this paper (Coling2022) and this repository looks reasonable.
라며
(어미) is missing in an AUX word. It causes mismatch of number of morphemes between LEMMA and XPOS fields.Another case:
There are 125 cases in the whole corpus. It happens in AUX words including
이
,있
,없
,싶
.않
.