UniversalDependencies / UD_English-EWT

English data
Creative Commons Attribution Share Alike 4.0 International
202 stars 42 forks source link

abbreviation + punctuation #296

Open kanayamah opened 2 years ago

kanayamah commented 2 years ago

When an abbreviation with period (e.g. U.S.A.) appears at the end of a sentence, the last period is regarded as the sentence-end punctuation.

# sent_id = email-enronsent17_01-0016
# text = 5. General Colin Powell, former Chairman, Joint Chiefs of Staff, U.S.A.
15  U.S.A   U.S.A   PROPN   NNP Number=Sing 4   list    4:list  SpaceAfter=No
16  .   .   PUNCT   .   _   4   punct   4:punct _

But the word U.S.A looks awkward. Shouldn't it be tokenized as U.S.A. assuming the punctuation is omitted?

Similar cases are p.m, C.V, L.L.C, i.e and U.S. http://match.grew.fr/?corpus=UD_English-EWT@2.9&custom=61eaeda1b353d

nschneid commented 2 years ago

End-of-sentence punctuation has an effect on the interpretation of the full sentence, whereas chopping off a period from one word is a minor inconvenience to that one word (which in this case could also be spelled without periods: USA). But I think it is worth ensuring that the lemma isn't missing a final period. It may also be worth signalling the token anomaly in a MISC feature, like we signal SpaceAfter=No. Maybe SharedPunctAfter=Yes?

kanayamah commented 2 years ago

Thank you, that's the point. Explicitly annotating "shared punctuation" is a good idea.

But this case does not look good splitting... (though this sentence itself is nonsense)

# sent_id = weblog-blogspot.com_marketview_20060625150800_ENG_20060625_150800-0011
# text = i.e.
1   i.e i.e X   FW  _   0   root    0:root  SpaceAfter=No
2   .   .   PUNCT   .   _   1   punct   1:punct _
amir-zeldes commented 2 years ago

That's interesting, I never noticed that about EWT! I should say this is not the case in UD_English-GUM - if a sentence token ends in a period and it is sentence final, then there's simply no period token:

https://corpling.uis.georgetown.edu/annis/#_q=c190eXBlIF9yXyAvLipbQS1aYS16XS4qXC4v&_c=R1VN&cl=0&cr=0&s=0&l=10&o=random

If a sentence ends in "U.S.A.", and we all agree that the standard spelling of the acronym should include all three periods, then I interpret that to mean that in English, when a sentence ends with a word ending in '.', we basically have a convention that says that no period is added to such a sentence.

I guess my intuition is therefore that there is no shared period, since there is no token to assign that period to - we just don't place periods after words that end with that character.

nschneid commented 2 years ago

Agree that if "i.e." is the entire utterance, it's not really a sentence deserving a separate end-of-sentence period.

If we're dealing with well-edited text, though, it seems odd to treebank a sentence as not having any final punctuation just because it is orthographically shared with an abbreviation. If I read

I flew to the U.S.A. Then I flew back home.

I would interpret the first part prosodically the same as if it ended in a period token. It's just an orthographic exception that ".." is avoided (as are ".?" and ".!"). Whereas

I flew to the USA Then I flew back home.

would be considered poorly edited because of the lack of end punctuation before "Then", and

I flew to the USA then I flew back home.

could be considered a single sentence.

I don't know if this problem is frequent enough to justify a whole new feature though. Especially since many genres of text do not abide by the normative orthographic rules.

nschneid commented 2 years ago

OK I thought of a solution that doesn't require a new feature: a multi-word token!

4   the the
5-6 U.S.A.  _
5   U.S.A.  U.S.A.
6   .   .

Though we usually use multi-word tokens for systematically merged elements due to grammar (e.g. possessive clitics), I wonder if they should also be used for systematically merged elements due to orthography when SpaceAfter=No is insufficient because there is sharing of characters.

kanayamah commented 2 years ago

@nschneid thank you for providing solution. Let me confirm: do you really represent as

5-6 U.S.A.  _
5   U.S.A.  U.S.A.
6   .   .

rather than below? (see the FORM of U.S.A).

5-6 U.S.A.  _
5   U.S.A   U.S.A.
6   .   .

My concerns are:

nschneid commented 2 years ago

The concatenation requirement does not hold WITHIN a multiword token. The forms of the subparts can be whatever we want them to be. Hence my proposal.

Correct, no need for SpaceAfter=No within an MWT because it contains no internal spaces by definition.

The MWT device was partially motivated be phenomena in other languages which are decidedly nonconcatenative contractions, e.g. French du = de + le. I'm not sure how tokenization algorithms handle such cases (perhaps they treat du as one token and ignore the subparts).

I don't know if other treebanks use MWT for characters that are shared between a word and non-word punctuation. I admit it may be a bit weird. OTOH it seems that different people have different intuitions regarding the best tokenization if formulated strictly as splitting the input sentence, and this is a way to capture both levels—the syntactic-functional level in which there is end punctuation as well as "U.S.A." being spelled with a period at the end, and the orthographic level in which there is just one "." character.

amir-zeldes commented 2 years ago

I agree with @kanayamah - for such a marginal phenomenon, I don't think it's worth it to ruin the neat concatenability of English tokenization as currently defined (this would have serious consequences for the kinds of tools that can be used to tokenize English).

And TBH I am still not convinced there is really a period "there". What would happen if such an abbreviation occurs at the end of a sequence that may or may not have a period, such as a heading or a bullet point, which sometime appear with and sometimes without? How do we "know" that it's there? One of the nice things about dependencies is that there are no traces, non-terminals or other debatable structures, aside from the relations it's largely WYSIWYG, and I wouldn't want to speculate too much about which periods exist that do not actually appear in the text.