Missing morphemes in lemma field of AUX

UniversalDependencies / UD_Korean-GSD

Korean UD Treebank.

Other

22 stars 3 forks source link

Missing morphemes in lemma field of AUX #4

Open kanayamah opened 1 year ago

kanayamah commented 1 year ago

라며 (어미) is missing in an AUX word. It causes mismatch of number of morphemes between LEMMA and XPOS fields.

# train-s14
20  이라며 이   AUX VCP+EC  _   18  cop

Another case:

# sent_id = train-s107
인   이   AUX VCP+ETM _   18  cop

There are 125 cases in the whole corpus. It happens in AUX words including 이,있,없,싶.않.

kanayamah commented 1 year ago

@dan-zeman I understand that this is due to this commit for the validation based on limited lemmas of auxiliary. But it reduces the usability of the corpus - the users need to convert lemma using the MISC field.

dan-zeman commented 1 year ago

Yes, I did it to let the Korean treebanks escape the NEGLECTED status and survive the next release. The OrigLemma attribute in MISC is intended only as a temporary measure, as I did not want to lose information that may be vital.

A plus sign in LEMMA is problematic because it (probably) violates the principle that the lemma is an existing surface form of the lexeme, which is declared as the canonical/citation form in the paradigm. It rather suggests that there are two lemmas on one line, which may signal incomplete word segmentation. I cannot judge the consequences for XPOS tags because XPOS falls outside the UD guidelines, hence it is not documented in UD (at least I don't see it documented). Ideally there should be morphological features in FEATS that would explain the function of the suffix.

dan-zeman commented 1 year ago

In fact, the lemma of the copula should probably be 이다; that seems to be the citation form that all sources use.

kanayamah commented 1 year ago

@dan-zeman first of all, thanks for your effort to keep the corpus:-)
I agree with you that the current Korean corpus (particularly LEMMA and XPOS fields) is not compliant with the UD principles. It is desirable to give better representation without + mark, but I think it is really confusing to partially modify the corpus.

BTW, the new proposal in this paper (Coling2022) and this repository looks reasonable.