IAHLT / UD_Hebrew

Hebrew Universal Dependencies Treebank
Other
2 stars 2 forks source link

Morphological features of PROPNs #9

Open Hilla-Merhav opened 3 years ago

Hilla-Merhav commented 3 years ago

@amir-zeldes

We have recently discussed the morphological features of the PROPN ארצות הברית and concluded – no Gender and no Number for PROPNs in general, but we can keep Definite=Cons for cases like ארצות. The wiki carpus brings up some complicated PROPNs. For example, I had:

  1. Names of basic laws (חוק יסוד: משאל עם) – I tagged משאל עם as a PROPN. (same for יסוד: כבוד האדם וחירותו, etc).
  2. Names of organizations (מגן דוד אדום) or even buildings (בהמשך, עבר הבניין ליפו ונקרא : "הבניין למתן שירותי הדם ע"ש הקולונל מיקי סטון")
  3. Names of military operations (מבצע חומת מגן) Should we remove all the Gender and Number features carried by these PROPNs (which for now analyzed automatically as NOUNs)? In my batch ('Basic Law') this work might be time consuming – would it be better to leave this removal to the QA?
amir-zeldes commented 3 years ago

I feel your pain in deleting the gender on these, but we can't have our cake and eat it... Either we agree to add Gender and Number to all PROPNs in HTB, so our training data is consistent, or we never add it. Having a mix will lead to chaos. An intermediate option is to keep them from now on and add them 100% automatically to HTB, but of course that will lead to some errors.

Adding @yifatbm for discussion and if you can add Noam please do, I don't think I have his GH handle

yifatbm commented 3 years ago

Thanks - adding @ivrit (Noam) to this thread :)

Hilla and I have further discussed this earlier today, and it seems the main issue is whether there is more than one type of PROPN (namely, one that should receive GENDER/NUMBER feats, and one that should not). But it looks like @amir-zeldes 's reply is already referring to that, saying that we should treat all PROPN the same. So we will remove in the auto-QA any feat other than Abbr=Yes and Def=Cons, for evry token whose POS=PROPN.

BTW - I see that Def=Cons is lacking as a PROPN feat in the Guidelines' table for morph feats (https://github.com/ivrit/IAHLT-HTB-GUIDELINES/blob/main/Morphological%20features.md).

amir-zeldes commented 3 years ago

Yes - to be clear, I don't think it's a good and UD conformant thing not to have gender for PROPNs, I just think it's worse to have multiple corpora with different practices, so if we're currently unable to add them to HTB, we should not be annotating them in new IAHLT materials (though as I've said, we can also auto-add them to everything and live with the occasional errors, or do a type-based sweep)

yifatbm commented 3 years ago

Yeah, I think it should be a good-enough solution to add the GENDER automatically. We can do it simultaneously for HTB and 'our' annotated sentences, at a later stage. @ivrit - I know you are very much in favor of providing GENDER to PROPNs, so I guess you agree :)

amir-zeldes commented 3 years ago

Mm.. OK, but if we do this then that means gender and number should be added to ALL of the PROPNs in the new data. If I had some time I could do this automatically in HTB, but I really really don't right now. I think in that case I think I'll just leave this issue open to represent this TODO on HTB.

yifatbm commented 3 years ago

Yes, of course. I mean adding those feats for all PROPNs found in 'our' corpus as well as in HTB.

@ivrit - do you think someone at our end would be able to handle this? Probably it's not that trivial, since we have to take into account the coligation of each PROPN (what is the gneder of the verb it modifies, or the adj that modifies it, etc.).

amir-zeldes commented 3 years ago

Just before I forget, what I said about this in the meeting: I think we should just force the morphological tagger to consider all PROPNs in the data as NOUNs, then extract the most probable Gender/Number classes and use those automatic predictions as the initial value, assuming we're auto-restoring them in all of the data.