UniversalDependencies / UD_English-PUD

Parallel Universal Dependencies.
Other
10 stars 2 forks source link

mistake on annotation of ranges #2

Open vcvpaiva opened 4 years ago

vcvpaiva commented 4 years ago

in sentences like:

  1. That 3% rate also applies to Nectar cardholders looking to borrow from £15,001-£19,999 over a period of between two and three years.

  2. Meanwhile, Bank of Scotland customers earn 3% on balances of £3,000-£5,000 when they add the free Vantage option to their account.

the hyphen should be considered punctuation, not symbol, I believe. At least this is what happens in a sentence like

  1. Theoretically, a couple could open four Tesco accounts and earn 3% on £12,000 – £360.

yes, these are different kinds of "hyphen", but do you really want to make this distinction em terms of POS?

dan-zeman commented 4 years ago

The last one is also a different type of construction. It is not a range. It is an apposition that explains how much 3% from 12,000 is.

On the other hand, in ranges the hyphen can be read aloud as "to", which is the distinction between symbols and punctuation (although for me, hyphen feels borderline).

vcvpaiva commented 4 years ago

yes, I realized it's a different type of construction, this is why I say "do you really want to make this distinction em terms of POS?" seems to me very much against the "easy to annotate" principle.

amir-zeldes commented 4 years ago

In both English-EWT and English-GUM, these hyphens in number ranges which are read as 'to' are analyzed as case, so that the second number can be e.g. nmod. In EWT they are tagged SYM and in GUM as ADP, though we should probably choose one and unify:

EWT: http://match.grew.fr/?corpus=UD_English-EWT@2.6&custom=5f831a41eabfd GUM: http://match.grew.fr/?corpus=UD_English-GUM@2.6&custom=5f831b4c18290

I would find it very odd for a word with a proper grammatical function such as case to have upos=PUNCT (and xpos is already not the equivalent tag :, so assuming we don't want to change English xpos guidelines, it would also create a mapping discrepancy if we used upos=PUNCT). Also note that, at least for GUM, the corpus passes through udapi's fix-punct block, meaning that if we tagged it as PUNCT, it could get re-attached in all sorts of bad ways.

nschneid commented 4 years ago

See discussion at UniversalDependencies/docs/issues/649

amir-zeldes commented 4 years ago

Right, no surprise we've already talked about this :)

So what's the verdict - which corpus should we change? Opinions @nschneid @sebschu / others? I don't feel strongly about it, except that it shouldn't be PUNCT.

vcvpaiva commented 4 years ago

Cool that you have already discussedthe issue (and sorry for not having read it beforehand). However, the issue persists. do you want the different kinds of hyphens to have different POS? it seems perverse. do you want all of them to become the preposition "to"-- it seems wrong. (to me, at any rate). as usual the question is which is the least evil?

nschneid commented 4 years ago
amir-zeldes commented 4 years ago

OK, GUM source repo now has SYM/SYM and I unified the lemma of en-dash and hyphen to be hyphen, matching EWT:

amir-zeldes/gum@d4fd9d2974a3a9db6469e21a52d32fe800ed0651

Parenthetical dashes are (and were already) :/PUNCT, so no problems there.