UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
269 stars 245 forks source link

Tokenization of hyphenated forms in English #1002

Open rhdunn opened 10 months ago

rhdunn commented 10 months ago

Looking at hyphenated compounds, there are several ways that English treebanks annotate these, sometimes inconsistently within the same treebank and across treebanks.

I'm basing this on https://universaldependencies.org/u/feat/Hyph.html.

Indo-Sri Lanka

EWT sent_id weblog-blogspot.com_dakbangla_20041119231111_ENG_20041119_231111-0033:

12  Indo    Indo    X   AFX _   15  compound    15:compound SpaceAfter=No
13  -   -   PUNCT   HYPH    _   12  punct   12:punct    SpaceAfter=No
14  Sri Sri PROPN   NNP Number=Sing 15  compound    15:compound _
15  Lanka   Lanka   PROPN   NNP Number=Sing 17  compound    17:compound _

my understanding is that this should be:

12  Indo-   Indo-   X   AFX Hyph=Yes    14  compound    15:compound SpaceAfter=No
13  Sri Sri PROPN   NNP Number=Sing 14  compound    15:compound _
14  Lanka   Lanka   PROPN   NNP Number=Sing 16  compound    16:compound _

This should also apply to Anglo-Saxon, etc.

Proto-Indo-European

GENTLE sent_id GENTLE_dictionary_school-8

65  Proto-Indo-European Proto-Indo-European PROPN   NNP Number=Sing 66  compound    66:compound Entity=(33-abstract-new-cf19-2-sgl(34-abstract-new-cf23-1-coref-Proto%2DIndo%2DEuropean_language)|XML=<ref target:::"https://en.wikipedia.org/wiki/Proto-Indo-European_language"></ref>

my understanding is that this should be:

65  Proto-  proto-  X   AFX Hyph=Yes    67  compound    66:compound SpaceAfter=No
66  Indo-   Indo-   X   AFX Hyph=Yes    67  compound    66:compound SpaceAfter=No
67  European    European    PROPN   NNP Number=Sing 67  compound    66:compound _

This should also apply to pro-Muslim, anti-Semite, etc. with the pro-, anti-, etc. modifiers being their own AFX tokens.

dan-zeman commented 10 months ago

EWT sent_id weblog-blogspot.com_dakbangla_20041119231111_ENG_20041119_231111-0033:

12    Indo    Indo    X   AFX _   15  compound    15:compound SpaceAfter=No
13    -   -   PUNCT   HYPH    _   12  punct   12:punct    SpaceAfter=No
14    Sri Sri PROPN   NNP Number=Sing 15  compound    15:compound _
15    Lanka   Lanka   PROPN   NNP Number=Sing 17  compound    17:compound _

my understanding is that this should be:

12    Indo-   Indo-   X   AFX Hyph=Yes    14  compound    15:compound SpaceAfter=No
13    Sri Sri PROPN   NNP Number=Sing 14  compound    15:compound _
14    Lanka   Lanka   PROPN   NNP Number=Sing 16  compound    16:compound _

Hyph=Yes is indeed meant for the first part of such compounds in case they are separate tokens and their form is different from independent word. But it does not specify what should be done with tokenization, that is, whether the hyphen shall be part of the form or a separate token. We use Hyph=Yes in Czech but we don't include the hyphen in the token that contains the prefix and that gets the feature.

amir-zeldes commented 10 months ago

AFAIK, the actual convention for AFX in LDC corpora is not like in EWT - in OntoNotes, it is only used for the same situations that Dan is referring to, where the affix 'word' is a separate token due to spacing, e.g.:

As the second noun demonstrates, the standard has been to not separate prefixes like anti- when they are spelled together, and GENTLE (and the other GU corpora) follows this standard.

nschneid commented 10 months ago

Keeping the hyphen within the AFX token makes logical sense to me. I checked the EWT source trees from LDC and they do have the separated HYPH tokens, so either they changed their standard or didn't apply it consistently. There are very few AFX tokens with hyphens in EWT—I only see about 5.