IAHLT / UD_Hebrew

Hebrew Universal Dependencies Treebank
Other
2 stars 2 forks source link

tagging of מישהו - what should we do away with? #55

Open IsraelLand opened 2 years ago

IsraelLand commented 2 years ago

Hi @amir-zeldes

What POS whould we apply for מישהו מטעמו - Obviously it's not the first time dealing with this, but these do come up -

  1. We need to tag for PronType=Ind
  2. Nouns cannot take this subrel, but PRONs (and DETs)

So if we assume it to be noun, we cannot tag Ind, but we're supposed to according to the guidelines.

We can do either -

  1. Go for noun and give up on the Ind "aspect" of things (currently against the guidelines)
  2. Prioritize PronType=Ind so we tag PRON, which might not be the right POS? Also it seems not the popular choice around TBs. (3. Why can DETs get PronType=Ind and not nouns? I assume this is a more universal UD thing so we cannot expect that to be changed. But the data seems to point otherwise)

the same goes for איפשהו איכשהו which, assuming are ADV, cannot get their specific subrels?

Thank you

amir-zeldes commented 2 years ago

Your analysis of the validator's behavior for מישהו is accurate: if it's a NOUN it can't have those features. I should say that the indefinite substitutives not being pronouns is probably borrowed from English, which ultimately goes back to the PTB tagset's decision to tag them NN. In many other languages (e.g. Slavic), these are all tagged PRON, and if you wanted to do that (and if we update it in HTB), I wouldn't necessarily be against it; on the other hand, it's not a big deal since this is a closed class of items.

As for איפשהו איכשהו, that is not correct: the validator does allow ADV to carry PronType, since in many languages pronominal adverbs are still tagged ADV. I believe the top of the PronType page confirms this:

https://universaldependencies.org/u/feat/PronType.html

IsraelLand commented 2 years ago

As for the ADVs, my bad, I misread. I see you mentioned their ability to get subrels as well.

As for the original question - what would you prefer? I think PRON grealy captures מישהו, with the added benefit of being able to subrel it. Otherwise, should we aspire for uniformity across this "subset" of words?
For example, other words, like משהו - which is mostly noun (as is מישהו), but the pretty similar German (et)was - is mostly PRON (where it isn't ADV) in most TBs, but PUD (in which it is noun).

amir-zeldes commented 2 years ago

The only real reason to keep NOUN is backwards compatibility with HTB, but if we change it in the corrected HTB then I would be OK with PRON

IsraelLand commented 2 years ago

Thank you!