Open kanayamah opened 2 years ago
End-of-sentence punctuation has an effect on the interpretation of the full sentence, whereas chopping off a period from one word is a minor inconvenience to that one word (which in this case could also be spelled without periods: USA). But I think it is worth ensuring that the lemma isn't missing a final period. It may also be worth signalling the token anomaly in a MISC feature, like we signal SpaceAfter=No
. Maybe SharedPunctAfter=Yes
?
Thank you, that's the point. Explicitly annotating "shared punctuation" is a good idea.
But this case does not look good splitting... (though this sentence itself is nonsense)
# sent_id = weblog-blogspot.com_marketview_20060625150800_ENG_20060625_150800-0011
# text = i.e.
1 i.e i.e X FW _ 0 root 0:root SpaceAfter=No
2 . . PUNCT . _ 1 punct 1:punct _
That's interesting, I never noticed that about EWT! I should say this is not the case in UD_English-GUM - if a sentence token ends in a period and it is sentence final, then there's simply no period token:
If a sentence ends in "U.S.A.", and we all agree that the standard spelling of the acronym should include all three periods, then I interpret that to mean that in English, when a sentence ends with a word ending in '.', we basically have a convention that says that no period is added to such a sentence.
I guess my intuition is therefore that there is no shared period, since there is no token to assign that period to - we just don't place periods after words that end with that character.
Agree that if "i.e." is the entire utterance, it's not really a sentence deserving a separate end-of-sentence period.
If we're dealing with well-edited text, though, it seems odd to treebank a sentence as not having any final punctuation just because it is orthographically shared with an abbreviation. If I read
I flew to the U.S.A. Then I flew back home.
I would interpret the first part prosodically the same as if it ended in a period token. It's just an orthographic exception that ".." is avoided (as are ".?" and ".!"). Whereas
I flew to the USA Then I flew back home.
would be considered poorly edited because of the lack of end punctuation before "Then", and
I flew to the USA then I flew back home.
could be considered a single sentence.
I don't know if this problem is frequent enough to justify a whole new feature though. Especially since many genres of text do not abide by the normative orthographic rules.
OK I thought of a solution that doesn't require a new feature: a multi-word token!
4 the the
5-6 U.S.A. _
5 U.S.A. U.S.A.
6 . .
Though we usually use multi-word tokens for systematically merged elements due to grammar (e.g. possessive clitics), I wonder if they should also be used for systematically merged elements due to orthography when SpaceAfter=No
is insufficient because there is sharing of characters.
@nschneid thank you for providing solution. Let me confirm: do you really represent as
5-6 U.S.A. _
5 U.S.A. U.S.A.
6 . .
rather than below? (see the FORM
of U.S.A
).
5-6 U.S.A. _
5 U.S.A U.S.A.
6 . .
My concerns are:
DONt
-> DO
+ Nt
, gotta
-> got
+ ta
). SpaceAfter=No
cannot be used for constituent words in a MWT, can it?U.S.A.
is overestimated in benchmarking.The concatenation requirement does not hold WITHIN a multiword token. The forms of the subparts can be whatever we want them to be. Hence my proposal.
Correct, no need for SpaceAfter=No
within an MWT because it contains no internal spaces by definition.
The MWT device was partially motivated be phenomena in other languages which are decidedly nonconcatenative contractions, e.g. French du = de + le. I'm not sure how tokenization algorithms handle such cases (perhaps they treat du as one token and ignore the subparts).
I don't know if other treebanks use MWT for characters that are shared between a word and non-word punctuation. I admit it may be a bit weird. OTOH it seems that different people have different intuitions regarding the best tokenization if formulated strictly as splitting the input sentence, and this is a way to capture both levels—the syntactic-functional level in which there is end punctuation as well as "U.S.A." being spelled with a period at the end, and the orthographic level in which there is just one "." character.
I agree with @kanayamah - for such a marginal phenomenon, I don't think it's worth it to ruin the neat concatenability of English tokenization as currently defined (this would have serious consequences for the kinds of tools that can be used to tokenize English).
And TBH I am still not convinced there is really a period "there". What would happen if such an abbreviation occurs at the end of a sequence that may or may not have a period, such as a heading or a bullet point, which sometime appear with and sometimes without? How do we "know" that it's there? One of the nice things about dependencies is that there are no traces, non-terminals or other debatable structures, aside from the relations it's largely WYSIWYG, and I wouldn't want to speculate too much about which periods exist that do not actually appear in the text.
When an abbreviation with period (e.g.
U.S.A.
) appears at the end of a sentence, the last period is regarded as the sentence-end punctuation.But the word
U.S.A
looks awkward. Shouldn't it be tokenized asU.S.A.
assuming the punctuation is omitted?Similar cases are
p.m
,C.V
,L.L.C
,i.e
andU.S
. http://match.grew.fr/?corpus=UD_English-EWT@2.9&custom=61eaeda1b353d