Long-distance attachment of punctuation

kanayamah commented 1 year ago

The attachment of , has been changed in this commit. Some of them are good, but I found many confusions and the learnability of parser is reduced compared to UD v2.10.

For example, I think , (Word 3) should attached to Word 2 as originally annotated.

# sent_id = train-s31
# text = 잉글우드에 묻혔고, 네 명의 자녀를 남겼다.
# translit = .ing.geul.u.deu.e .mud.hyeoss.go, .ne .myeong.yi .ja.nyeo.reul .nam.gyeoss.da.
1   잉글우드에   잉글우드+에  ADV NNP+JKB _   2   obl
2   묻혔고 묻히+었+고  VERB    VV+EP+EC    _   0   root
3   ,   ,   PUNCT   SP  _   7   punct
4   네   네   NUM MM  NumType=Card    5   nummod
5   명의  명+의 NOUN    NNB+JKG _   6   nmod:poss
6   자녀를 자녀+를    NOUN    NNG+JKO _   7   obj
7   남겼다 남기+었+다  VERB    VV+EP+EF    _   2   conj
8   .   .   PUNCT   SF  _   7   punct

Many other similar cases found in train-s37, train-s38, etc.

martinpopel commented 1 year ago

The UD punctuation guidelines are very clear in this aspect:

A punctuation mark separating coordinated units is attached to the immediately following conjunct.

So the comma must be attached to word 7 (which is the head of the immediately following conjunct), there is no other option. What do you mean by "learnability of parser"? The UD punctuation guidelines can be converted to an algorithm that fixes the attachment of punctuation. If the training data consistently follows the UD punctuation guidelines (e.g. by fixing it by udapy ud.FixPunct < in.conllu > fixed. conllu), modern parsers will learn these rules easily. If the test data follows these guidelines as well, there should be no errors in punctuation in the parser output.

dan-zeman commented 1 year ago

I confirm that Udapi's ud.FixPunct was used when preparing the commit referred above by @kanayamah.

kanayamah commented 1 year ago

@martinpopel @dan-zeman Thank you for answer. I understand the principles of punctuation regarding coordination.

Particularity in head-final languages, these structures may be counterintuitive since the comma is regarded as a part of preceding word. I think it is related to the coordination orientation discussed in this issue and our paper.

kanayamah commented 1 year ago

@martinpopel Relationship VERB+고 <-> VERB can be recognized as advcl or conj and the distinction between them is subtle. Suppose there is a verb A and B in word 5 and 10, and a comma is following the verb A. If A and B are in advcl relationship, the structure is like this,

5   A고  A+고 VERB    10  advcl
6   ,   ,   PUNCT   6   punct
...
10  B   B   VERB    0   root

On the other hand, if they are regarded as coordination, it forms a totally different structure due to the left-head coordination principle:

5   A고  A+고 VERB    0   root
6   ,   ,   PUNCT   10  punct
...
10  B   B   VERB    5   conj

and the recent change (regarding the attachment of punct -- the head was changed from 5 to 10) increases the difficulties of comma's attachment prediction. That's why I mentioned the learnability -- which has been already discussed in the paper on coordination in head-final languages.

dan-zeman commented 1 year ago

Relationship VERB+고 <-> VERB can be recognized as advcl or conj and the distinction between them is subtle.

This is interesting. Are there grammatical tests that would decide between advcl and conj?

UniversalDependencies / UD_Korean-GSD

Long-distance attachment of punctuation #5