Open Hilla-Merhav opened 3 years ago
I feel your pain in deleting the gender on these, but we can't have our cake and eat it... Either we agree to add Gender and Number to all PROPNs in HTB, so our training data is consistent, or we never add it. Having a mix will lead to chaos. An intermediate option is to keep them from now on and add them 100% automatically to HTB, but of course that will lead to some errors.
Adding @yifatbm for discussion and if you can add Noam please do, I don't think I have his GH handle
Thanks - adding @ivrit (Noam) to this thread :)
Hilla and I have further discussed this earlier today, and it seems the main issue is whether there is more than one type of PROPN (namely, one that should receive GENDER/NUMBER feats, and one that should not). But it looks like @amir-zeldes 's reply is already referring to that, saying that we should treat all PROPN the same. So we will remove in the auto-QA any feat other than Abbr=Yes and Def=Cons, for evry token whose POS=PROPN.
BTW - I see that Def=Cons is lacking as a PROPN feat in the Guidelines' table for morph feats (https://github.com/ivrit/IAHLT-HTB-GUIDELINES/blob/main/Morphological%20features.md).
Yes - to be clear, I don't think it's a good and UD conformant thing not to have gender for PROPNs, I just think it's worse to have multiple corpora with different practices, so if we're currently unable to add them to HTB, we should not be annotating them in new IAHLT materials (though as I've said, we can also auto-add them to everything and live with the occasional errors, or do a type-based sweep)
Yeah, I think it should be a good-enough solution to add the GENDER automatically. We can do it simultaneously for HTB and 'our' annotated sentences, at a later stage. @ivrit - I know you are very much in favor of providing GENDER to PROPNs, so I guess you agree :)
Mm.. OK, but if we do this then that means gender and number should be added to ALL of the PROPNs in the new data. If I had some time I could do this automatically in HTB, but I really really don't right now. I think in that case I think I'll just leave this issue open to represent this TODO on HTB.
Yes, of course. I mean adding those feats for all PROPNs found in 'our' corpus as well as in HTB.
@ivrit - do you think someone at our end would be able to handle this? Probably it's not that trivial, since we have to take into account the coligation of each PROPN (what is the gneder of the verb it modifies, or the adj that modifies it, etc.).
Just before I forget, what I said about this in the meeting: I think we should just force the morphological tagger to consider all PROPNs in the data as NOUNs, then extract the most probable Gender/Number classes and use those automatic predictions as the initial value, assuming we're auto-restoring them in all of the data.
@amir-zeldes
We have recently discussed the morphological features of the PROPN ארצות הברית and concluded – no Gender and no Number for PROPNs in general, but we can keep Definite=Cons for cases like ארצות. The wiki carpus brings up some complicated PROPNs. For example, I had: