tokenization of compounds

arademaker commented 3 years ago

Japanese newspaper Nihon Keizai Shimbun reported that the three giants plan to integrate their cargo computers and ground - cargo and air - cargo systems.

This is a sentence from the LDC95T7 later expanded with SRL in the LDC2012T04 datasets. The two datasets differ according tokenization. What is the current guideline in EWT for the tokenization of gound-cargo and air-cargo?

The conversion code from Stanford team preserves the tokenization from the PTB analysis:

1       Japanese        japanese        ADJ     JJ      _       5       amod    _       _
2       newspaper       newspaper       NOUN    NN      _       5       compound        _       _
3       Nihon   Nihon   PROPN   NNP     _       5       compound        _       _
4       Keizai  Keizai  PROPN   NNP     _       5       compound        _       _
5       Shimbun Shimbun PROPN   NNP     _       6       nsubj   _       _
6       reported        report  VERB    VBD     _       0       root    _       _
7       that    that    SCONJ   IN      _       11      mark    _       _
8       the     the     DET     DT      _       10      det     _       _
9       three   three   NUM     CD      _       10      nummod  _       _
10      giants  giant   NOUN    NNS     _       11      nsubj   _       _
11      plan    plan    VERB    VBP     _       6       ccomp   _       _
12      to      to      PART    TO      _       13      mark    _       _
13      integrate       integrate       VERB    VB      _       11      xcomp   _       _
14      their   they    PRON    PRP$    _       16      nmod:poss       _       _
15      cargo   cargo   NOUN    NN      _       16      compound        _       _
16      computers       computer        NOUN    NNS     _       13      obj     _       _
17      and     and     CCONJ   CC      _       21      cc      _       _
18      ground-cargo    ground-cargo    NOUN    NN      _       21      amod    _       _
19      and     and     CCONJ   CC      _       20      cc      _       _
20      air-cargo       air-cargo       NOUN    NN      _       18      conj    _       _
21      systems system  NOUN    NNS     _       16      conj    _       _
22      .       .       PUNCT   .       _       6       punct   _       _

If not a single token, I would take these two cases as compound, right? But I found many more examples like so-called, by-outs, re-election etc.. Not clear if all should be compounds. If we break in two tokens the last one, for instance, we would have a free morphome that is not a syntactic word.

manning commented 3 years ago

They should both be split up. But, yes, the Stanford conversion code just works with whatever tokens are in the source constituency treebank; it does not adjust tokenization.

For EWT, we follow the "new LDC treebank" tokenization used in the original (LDC2012T13) and other LDC Treebanks released in the 21st century. The policy for hyphenation there is that most hyphenated expressions are broken up but not those for various short prefixes (and a few suffixes), so you still get e-mail or co-operation.

Note that hyphenation is at present not consistent across UD_English treebanks. E.g., GUM does not split up hyphenated terms.

If you have it available, you can find the same sentences in new LDC treebank tokenization in OntoNotes (LDC2013T19) - or sometimes it's necessary to look in LDC2015T13. These later versions also have the advantage of correctly showing noun compound modification structure. For example, this sentence becomes:

(TOP
  (S
    (NP-SBJ
      (NML (JJ Japanese) (NN newspaper))
      (NNP Nihon) (NNP Keizai) (NNP Shimbun))
    (VP (VBD reported)
      (SBAR (IN that)
        (S
          (NP-SBJ-1 (DT the) (CD three) (NNS giants))
          (VP (VBP plan)
            (S
              (NP-SBJ (-NONE- *PRO*-1))
              (VP (TO to)
                (VP (VB integrate)
                  (NP
                    (NP (PRP$ their) (NN cargo) (NNS computers))
                    (CC and)
                    (NP
                      (NML
                        (NML (NN ground) (HYPH -) (NN cargo))
                        (CC and)
                        (NML (NN air) (HYPH -) (NN cargo)))
                      (NNS systems))))))))))
    (. .)))

amir-zeldes commented 3 years ago

GUM does not split up hyphenated terms

Yes, just chiming in to say this is correct, GUM has PTB-style tokenization, where hyphenated words whose parts do not constitute separate words are not tokenized apart. This means that "Bill Clinton-Al Gore relations" would be 6 tokens (since "Clinton-Al" is not a single word, and must be tokenized), but most hyphenated words which can appear in isolation, such as "left-handed" or "All-District" are kept together. Basically, things that have a sensible POS and valency role by themselves generally stay as one token.

One of the reasons we haven't looked into changing this is that the current analysis of hyphenated modifiers in EWT is sometimes a little odd, mainly for deprels but also POS. For things that are just "coincidentally" written with hyphen but could be written apart, it usually makes sense to me, and is often just compound (such as "air-cargo"). But consider these cases:

AK-47 - compound(47,AK)
Sector-37 - nummod(Sector,37)
ill-fated - ill/ADJ amod(fated, ill)
99-tonne cache - nummod(tonne,cache) but tonne has Number=Sing
wrong/ADJ-doers - nmod:npmod(doers,wrong)
US-trained - obl:npmod(trained,US)

Some of them also vary:

"Al-" in Arabic names is sometimes compound modifier, sometimes head and flat (the latter seems better to me)
independent-minded - independent/ADJ amod(minded,independent) (this is the same as ill-fated) BUT: strong/ADJ-arm/VERB - advmod(arm,strong) - not sure how this passes validation with ADJ+advmod?

I think we'd consider changing the tokenization in GUM if I had a better understanding of what the target guidelines are. It seems like sometimes ideas from a paraphrased phrasal version of what is realized morphologically as a compound are taken into the hyphenated word analysis, and other times it's a more straightforward morphological analysis (often just "compound", which seems fine to me TBH).

For some of the suggested relations in EWT, such as amod, nummod and :npmods, I think the "syntax below zero" way of thinking leads to problems, since the agreement morphology isn't there and kind of shows that we are dealing with something different syntactically (or simply that we are dealing with morphology here, and not syntax at all). The same issue extends to POS: I agree it's odd to tag "ill" or "independent" as ADV, but at the same time "independent" is how one is minded, and it is modified something adjectival, so it should be advmod if we treat this as a syntactic modification...

manning commented 3 years ago

Yes, I agree that the current LDC rules end up an awkward compromise. Though, actually, I think dividing hyphenations is mostly better (even for most of the examples you cite above) but it becomes ridiculous when it splits things like model names, abbreviations, or certain people's names where the parts just shouldn't be separated, like "AK-47", "T-34", "F-16" or Korean names like "Hye-won" or "Jae-hoon". But it is what it is.

amir-zeldes commented 3 years ago

OK, so there are two issues here:

What should be split - this is currently non-great, but may be better than not splitting anything, so I'm willing to consider mainly just "doing what EWT does" in GUM (though, should we avoid splitting things like AK-47? Or should we split them on purpose to be "like EWT"?)
1. What the deprels should be - this is currently not clear enough to me in order to implement in GUM, so it prevents us from applying 1.

Since we are still actively expanding the corpus, it would be nice to consolidate tokenization with EWT before this semester's GUM expansion (and its tokenization) happen, but that would require an authoritative analysis of the right thing to do, since we definitely don't want to do this twice and I need to put it into explicit guidelines for the teaching materials.

Are there guidelines for doing deprels for EWT-style hyphenated words somewhere? And also, I'm willing to pick this up for GUM, but is there someone willing to look into consistency within EWT?

nschneid commented 3 years ago

I think there is some virtue in the simplest possible tokenization rule, which would be to always tokenize hyphens. This also has the benefit of OntoNotes/recent LDC compatibility.

We can always use flat for rare cases like AK-47 where the hyphen is a fixed part of a name.

As for the internal deprels of other hyphenated expressions, my inclination is to annotate transparent syntax (such as adjective and PP modifiers) but default to compound for any compound-specific word order or morphology, e.g. "cost-specific", "man-eating". Cf. UniversalDependencies/docs#753

nschneid commented 3 years ago

Also, EWT currently uses amod for many ADJ-ADJ compounds ("narrow-minded", "American-Islamic"), which strikes me as odd—compounding of two adjectives is a distinct construction from normal adjectival modification, so why not use compound for these?

amir-zeldes commented 3 years ago

We can always use flat for rare cases like AK-47 where the hyphen is a fixed part of a name.

I mean, the problem is that it isn't always there ("AK47"), so you're getting different amounts of tokens with/without the hyphen, but I guess I can live with that. Does the hyphen get "flat" or "punct"? Is it it :/HYPH/SYM or something else?

EWT currently uses amod for many ADJ-ADJ compounds ("narrow-minded", "American-Islamic")

Yes, I agree this seems wrong - I think our options are compound like you say, in which case we should keep tagging ADJ-ADJ, or advmod (=narrowly minded) but then tag ADV. Over all I'm with you in preferring ADJ+compound, but note that "strong-arm" above seems to be going against that policy, while ill-fated does have ADJ, but the odd amod relation you pointed out above.

nschneid commented 3 years ago

PUNCT and punct for the hyphen, I think. That way "AK 47" and "AK-47" are the same modulo the hyphen.

nschneid commented 3 years ago

"Strong-arm" is a tricky example of a compound verb where neither of the constituents is natively a verb. A more proper treatment of the MWE structure might be to have "arm" as a NOUN modified by "strong" and then use ExtPos=VERB on "arm". But I can live with the current analysis that "arm" itself has been coerced into a VERB. Would prefer compound over the current advmod, though.

nschneid commented 3 years ago

Another issue with hyphenation is whether for hyphenated prefixes the hyphen attaches to the prefix or the stem. EWT has left-attaching punct in "al-Sadr" and "F-102", right-attaching punct in "al-Qaeda". What should be the policy? Attach punctuation to the right by default, with an exception for clitic-like prefixes such as "anti" and Arabic "al"?

amir-zeldes commented 3 years ago

I would tend to agree ideally, but xpos=":" for the hyphen complicates automatic assignment of HYPH for the newly created tokens in GUM, and it is actually not the case in EWT (AK-47 has "-" as xpos="HYPH").

"strong arm" as ADJ+compound, VERB+xyz sounds good to me

Punctuation attachment is ostensibly meaningless IMO (and GUM assigns it via udapi as a last postprocessing step BTW), so I don't much care about that. I'm a little queasy about breaking up things like "F-102" in GUM which are currently actually correct from my perspective and we would be 'breaking' on purpose... Very unsure "al-" should be broken up too, esp. since it's sometimes in non-Arabic languages (names of places in Pakistan, people from Bangladesh, etc.) where it is not analyzable as an article anymore.

Also, would we break English proper names with hyphens? Or Chinese ones like Hui Chi-fung? What about COVID-19? How about these:

Jorvik (pronounced " Yor-vik " )
Mary-Kate and Ashley Olsen
The villages of Mof-Ávvi
i-Phone
Magi-Fest (name of a magic festival)

The more I look at examples from GUM, the more unsure I'm becoming of the idea of just splitting ALL hyphens...

nschneid commented 3 years ago

I agree that if tokenization were done manually based on intuition I wouldn't split up hyphenated names like "al-Qaeda" or "AK-47". But erring on the small token side doesn't seem too terrible to me because we have relations like flat to glue things together if we don't want to call either part the head. (Multiword tokens would be an option as well.)

Further, many hyphenations resemble constructions that can occur without hyphens ("Mary-Kate" is not all that different from "Mary Kate" except in spelling; pronunciation cues could be spelled with spaces instead of hyphens; and there are product names with a non-hyphenated number). "Al Qaeda" is often not hyphenated. And there are bound to be a plenty of cases where tokenization is non-obvious. ("The pig goes oink-oink-oink"—are the hyphens strictly necessary? "OH-03" for Ohio's third Congressional district. "Type Ctrl-Alt-Del". Etc.)

So I guess the question is whether you would find it more appealing to develop and enforce a new UD-specific policy for which hyphens not to tokenize. :)

P.S. "i-Phone" is arguably not spelled correctly and could be goeswith(i, Phone).

amir-zeldes commented 3 years ago

I see what you're saying about flat, but I think some of them are not equivalent. I think "Mary-Kate" is not two names, it's a single name, and it's pronounced differently from "Mary Kate" (middle name Kate), which would appear the same under "flat" annotation. And for things like "Yor-vik" it seems straight up wrong to separate them - that hyphen glyph is just giving the syllabification of how to pronounce something which is clearly a single word in English (I suppose the vik must be Old Norse for village or the like, but still...)

Maybe we could say that they should be separated only if the relation would not be flat?

nschneid commented 3 years ago

I see what you're saying about flat, but I think some of them are not equivalent. I think "Mary-Kate" is not two names, it's a single name, and it's pronounced differently from "Mary Kate" (middle name Kate), which would appear the same under "flat" annotation.

I wouldn't know to pronounce "Mary-Kate" differently from "Mary Kate". Hyphenated surnames are a similar example. I can understand there's a semantic difference but it does not bother me to call it two words joined by flat.

And for things like "Yor-vik" it seems straight up wrong to separate them - that hyphen glyph is just giving the syllabification of how to pronounce something which is clearly a single word in English (I suppose the vik must be Old Norse for village or the like, but still...)

Metalinguistic stuff like a pronunciation guide is an explicit departure from normal orthographic conventions of English. I would treat it like a foreign phrase.

Maybe we could say that they should be separated only if the relation would not be flat?

Then you're bringing syntactic considerations ("is there a head?") into your tokenization decisions, which could be dicey...

nschneid commented 3 years ago

There is also the issue of hyphens used as dashes or something akin to dashes. If I speak of the Bush-Cheney ticket or the Smoot-Hawley Tariff, it would seem important to use multiple tokens as two separate entities can be identified. Those are (at least plausibly) annotated as flat, though.

amir-zeldes commented 3 years ago

Then you're bringing syntactic considerations ("is there a head?") into your tokenization decisions, which could be dicey...

Yes, you're right, that's probably a bad idea. But from your response to Yor-vik it sounds like you're OK with there being some rare exceptions to tokenizing hyphens (which I guess has to be true anyway, since in URLs or filenames they shouldn't be split either, right?). So I think we're on the same page here, let me look into this some more for GUM.

Bush-Cheney ticket

Things like that are already tokenized apart in GUM (see Clinton/Gore discussion above), though we went with compound instead of flat, in the tradition of copulative compound analysis. I think it's more linguistically correct (at least from a traditional morphology perspective), and using flat can create the impression that "Bush-Cheney" is a double-barrelled name (I know flat does not mean name, but still)

nschneid commented 3 years ago

Right, the problem is that the compound relation works best for endocentric compounds, where we assert that the second element is the head. That doesn't ring true for Bush-Cheney. In principle I suppose this could justify a new subtype of compound, or even consider conj. But that's another discussion for another issue.

arademaker commented 3 years ago

Similar to this last example, names teams combining to describe a match

Flamengo-Palmeiras last Saturday...

I would consider as conj

arademaker commented 3 years ago

@manning , thank you for your explanation above but I would like to clarify the relation between https://catalog.ldc.upenn.edu/LDC2013T19 (OntoNotes 5.0) with https://catalog.ldc.upenn.edu/LDC2012T04 (2009 CoNLL Shared Task). The doc of LDC2012T04 says that the texts came from https://catalog.ldc.upenn.edu/LDC95T7. I didn't know that LDC95T7 is part of LDC2013T19 (OntoNotes). I just found ontonotes-release 5.0/data/files/data/english/annotations/nw/wsj/ with the same subdirectories of WSJ that LDC95T7/combined/wsj/!

So LDC95T7 is part of LDC2013T19! My bad! I didn't know that. I also didn't know about https://catalog.ldc.upenn.edu/LDC2015T13, it is a revised version of https://catalog.ldc.upenn.edu/LDC95T7 but sadly, there is no reference to this new version in the https://catalog.ldc.upenn.edu/LDC95T7 page.

Since I am projecting the SRL into UD annotations obtained from LDC95T7, and I have already done that for the https://github.com/propbank/propbank-release (EWT and Ontonotes), I already have the data I was trying to obtain from LDC2012T04. I suspect that probably the SRL annotations from LDC2012T04 should also differ from the OntoNotes annotations in https://github.com/propbank/propbank-release, being the last probably the most recent one. Am I right? Maybe @MarthaSPalmer can also have something to add here and can confirm my understanding!

amir-zeldes commented 3 years ago

OK, this is now done in GUM and GUMReddit:

UniversalDependencies/UD_English-GUM@b46616da75bd5e4e50fca80ad645b405eb78412b UniversalDependencies/UD_English-GUMReddit@d422ab3e5f15e33401942b0cefdab48887fdd6d8

Tokenization should now be compatible between EWT and GUM, and I would feel comfortable training a tokenizer jointly on the two datasets. We've also added the xpos HYPH as needed, but not NFP.

Thanks for your help @logan-siyao-peng !

UniversalDependencies / UD_English-EWT

tokenization of compounds #204