Closed rhdunn closed 1 week ago
EWT sent_id weblog-blogspot.com_dakbangla_20041119231111_ENG_20041119_231111-0033:
12 Indo Indo X AFX _ 15 compound 15:compound SpaceAfter=No 13 - - PUNCT HYPH _ 12 punct 12:punct SpaceAfter=No 14 Sri Sri PROPN NNP Number=Sing 15 compound 15:compound _ 15 Lanka Lanka PROPN NNP Number=Sing 17 compound 17:compound _
my understanding is that this should be:
12 Indo- Indo- X AFX Hyph=Yes 14 compound 15:compound SpaceAfter=No 13 Sri Sri PROPN NNP Number=Sing 14 compound 15:compound _ 14 Lanka Lanka PROPN NNP Number=Sing 16 compound 16:compound _
Hyph=Yes
is indeed meant for the first part of such compounds in case they are separate tokens and their form is different from independent word. But it does not specify what should be done with tokenization, that is, whether the hyphen shall be part of the form or a separate token. We use Hyph=Yes
in Czech but we don't include the hyphen in the token that contains the prefix and that gets the feature.
AFAIK, the actual convention for AFX in LDC corpora is not like in EWT - in OntoNotes, it is only used for the same situations that Dan is referring to, where the affix 'word' is a separate token due to spacing, e.g.:
As the second noun demonstrates, the standard has been to not separate prefixes like anti- when they are spelled together, and GENTLE (and the other GU corpora) follows this standard.
Keeping the hyphen within the AFX token makes logical sense to me. I checked the EWT source trees from LDC and they do have the separated HYPH tokens, so either they changed their standard or didn't apply it consistently. There are very few AFX tokens with hyphens in EWT—I only see about 5.
Looking at hyphenated compounds, there are several ways that English treebanks annotate these, sometimes inconsistently within the same treebank and across treebanks.
I'm basing this on https://universaldependencies.org/u/feat/Hyph.html.
Indo-Sri Lanka
EWT sent_id weblog-blogspot.com_dakbangla_20041119231111_ENG_20041119_231111-0033:
my understanding is that this should be:
This should also apply to
Anglo-Saxon
, etc.Proto-Indo-European
GENTLE sent_id GENTLE_dictionary_school-8
my understanding is that this should be:
This should also apply to
pro-Muslim
,anti-Semite
, etc. with thepro-
,anti-
, etc. modifiers being their ownAFX
tokens.