IAHLT / UD_Hebrew

Hebrew Universal Dependencies Treebank
Other
2 stars 2 forks source link

Distinction between compound:affix and prefix=Yes? #39

Closed IsraelLand closed 2 years ago

IsraelLand commented 2 years ago

Hi @amir-zeldes

It's not clear to me, if the constrained prefix list (which we "loosened" a bit a short while ago) is directly, 100% correlated to the compound:affix deprel. If we could adopt (or again, maybe it already exists) a distinction between the two, it might be beneficial.

A few weeks ago, I encountered חיידקים גראם חיוביים which we decided is still a prefix, albeit kept in a seperate, more obscure list. Fair enough. I've now encountered חומצה 6-אמינופניצילנית Which, if indeed the compound:affix route is the right one - it seems quite absurd to move away from a set list to a list that contains stuff like digits.

I'm open to other tagging ideas, but the general idea remains - should the prefix=Yes feature be a constrained, set one (containing only the original prefixes) while the compound:affix a more general one? Then we'd do away with the new list, and apply prefix=Yes only to a set number of affix cases.

As @NathanD38 pointed out, compound:affix is used for Hebrew years ה etc. "thousands-quantifiers", but I don't suppose we apply the prefix=Yes feature for them.

amir-zeldes commented 2 years ago

TBH the whole Prefix feature is really optional. I have no issue with making it 1:1 used in the same cases as the compound:affix deprel. I think 6- makes sense since it is a natural extension of the type "du-" and "tlat-" (and we can end the list by saying "and any prefix-type number of this sort")

IsraelLand commented 2 years ago

Right. So to sum up, what would you prefer (well, considering the feature is kinda optional, but we still strive for uniformity) -

  1. Prefix and compound:affix are unrelated. Only a fixed set of affixes get the prefix=yes feature.
  2. They're both 1:1 correlated, if it's a compound:affix case then the prefix = Yes, even for Hebrew date years, and we don't need the list anymore
  3. They're 1:1, but we keep the list (and document any new ones?) which I don't see as very useful as they all get the same treatment based on environment alone.

Thanks

amir-zeldes commented 2 years ago

I'm basically for 2, although I recognize for the Hebrew date case it's a bit weird to call it a prefix. But excluding just that one seems a bit arbitrary, so maybe 1:1 is the most sensible.

IsraelLand commented 2 years ago

Yeah, treating them all in the same 1:1 way seems reasonable, all things considered. Thank you

IsraelLand commented 2 years ago

Sorry for resurrecting this, but just for this very specific case - Would this prefix be tagged ADV like all prefixes, or NUM like all numbers? I assume it's still a number, even though it's placed in a prefix position, just making sure. Thank you

amir-zeldes commented 2 years ago

Ooh, good point. Yeah, if it's spelled like an Arabic numeral it would be very weird to call it ADV, I suppose we should document this exception and tag it NUM indeed.

IsraelLand commented 2 years ago

Great, I'll mention it. Thanks!