Closed kscanne closed 3 years ago
I'd recommend keeping prefixes in the lemmas, as in this patch. Advantages: (1) by far the easiest thing for annotators (2) most consistent with existing treebank (e.g. for 64 of 74 words starting with neamh- the prefix is currently preserved in lemma) (3) most consistent with how Irish dictionaries handle prefixes, for the most part. FGB has "fo-alt" as a headword for example. It doesn't have "príomhfhoirgneamh" but has "príomhaidhm", "príomh-aire", "príomhaisteoir", ... about 50 in all.
I agree that the prefix should be kept in words like "droch-chlú" and "mí-ádh". Just wondering about tokens such as "an-gheit", "ana-dheas", "ró-fhuar", "sár-iarracht", "frith-Éireannach", "nua-Naitsithe", "lár-téarma" because they don't tend to appear in the Irish dictionaries that way but they might without the prefix. Then I am also curious about whether suffixes e.g. "ghrá-sa", "ghrúpa-san" need a different approach.
I do prefer the simpler approach of keeping the prefixes for the reasons above, but there is certainly an argument in a case like "an-" since, as you note, those forms are rarely if ever are listed in dictionaries, and because the hyphen makes it unlikely that a machine-learned tokenizer would strip the prefix when it shouldn't. "ró" would be in second place for me, maybe "dea-" also since the hyphen is there, but would only do this reluctantly!
I'd definitely be against stripping "sár", for example, since those words often do appear in dictionaries (sárdhuine, sáreolas, sármhaith, etc in FGB), plus I think you'd need to preserve the prefix in some cases and not others... e.g. I wouldn't want "sárocsaíd" (peroxide) to have "ocsaíd" as a lemma. There's then the danger that a learned tokenizer will strip the prefix from OOV words that happen to have the same starting string (sáraitheach, etc.) Similar argument for nua-, lár-, etc.
The emphatics suffixes are different beast altogether (to me) since they don't change the underlying sense other than the emphasis. Happier stripping those for the lemma, but I'll let Teresa weigh in.
Is the motivation for regarding a prefix as a dictionary entry perhaps linked to how productive it is? For example an- and ró- could potentially be attached to any adjective, while other prefixes are limited in their use.
Regarding emphatic suffixes, I wouldn't be inclined to split them into two tokens because I'd see them as a form variation - same way as we treat sibhse, eisean, chuidse. But I think we should remove the suffix in the lemma form ghrá-sa -> grá
A lot of these originated in Elaine's POS-tagged corpus - ie the output of the Rule-based Morphological analyser and lemmatiser. Do we want to maintain alignment with this or just ensure consistency throughout, regardless of prefix type?
e.g fho-alt : lemma = alt, phríomhfhoirgneamh : lemma = foirgneamh