apertium / apertium-kaz

Apertium linguistic data for Kazakh
https://apertium.github.io/apertium-kaz/
GNU General Public License v3.0
17 stars 9 forks source link

Ud kazakh ktb v2.7 #17

Open IlnarSelimcan opened 4 years ago

IlnarSelimcan commented 4 years ago

https://github.com/apertium/apertium-kaz/pull/16 convolutes changes to the Constraint Grammar with the corrections in the UD treebank. To make merging easier/faster, I decided to make a separate PR out of the latter.

I worked on the conllu file directly. The changes to it will have to be "backpropagated" into tagged.txt files.

A brief description of the changes.

  1. dependents of the predicate which were labeled nmod were changed to obl, as they should be according to version 2 of the annotation guidelines:

The nmod relation, which in v1 was used for nominals modifying either predicates or other nominals, is in v2 restricted to modifying nominals. A new relation obl (oblique) is introduced for oblique dependents of predicates. (https://universaldependencies.org/v2/summary.html)

  1. punctuation was re-attached projectively:

Coordinating conjunctions (cc) and punctuation (punct) inside coordinated structures are in v2 attached to the immediately succeeding conjunct (instead of the first conjunct as in v1). (https://universaldependencies.org/v2/summary.html)

[Line 1935 Sent akorda-random.tagged.txt:164:2942 Node 5]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'PRON'

The majority of the validation errors were like the following:

[Line 115 Sent akorda-random.tagged.txt:8:120 Node 11]: [L5 Morpho aux-lemma] 'отыр' is not an auxiliary verb in language [kk]

which was due the incompleteness / out-of-datedness of the language-specific documentation rather than the issues with the treebank itself (turns out that these language-specific lists of auxiliaries & copulas are kept in the validation script itself. A pull-request has been made to it, see below).

Note that as of yet the treebank does not fully validate, about 20 issues remain.

I've double checked my own changes by going over https://github.com/apertium/apertium-kaz/pull/17/files having opened up the treebank before and after in UD-Annotatrix. Please let me know if you think that I've made things worse, especially if you notice that I made an error consistently.

Reviewing this PR carefully is likely to take three-four full working days. That's what double checking my own changes seemed to take me.

I really hope that the next release of UD (scheduled for November 15, data freeze is on November 1) will include this new version. What remains to be done:

  1. Making sure that the treebank validates against validate.py.
  2. Updating the documentation of the treebank.
    • at http://taruen.com/apertium-kaz/ (Section 7.1. Open questions about Kazakh UD) I tried to keep track of the issues which need to be discussed, but keep in mind that those notes were taken "in the heat of annotating" and that most of them at the moment are probably too brief to be discussable and thus will have to be re-checked by me first and turned into a more tangible form.
    • note that the `AN ASSESSMENT OF UNIVERSAL DEPENDENCY ANNOTATION GUIDELINES FOR TURKIC LANGUAGES' (2017) paper contains some more info / specific tests on cases marked as being discussed in the language-specific documentation
  3. There seemed to be some more sentences in the UD_KTB repo, those should be validated too.
  4. Auxiliaries need to be added to validate.py.
    • https://github.com/UniversalDependencies/tools/pull/69 does just that.
    • Note that the verb digging in the sentence The boss said to start digging in the guidelines is labeled as xcomp of start, whereas in Kazakh we treat баста in -A.<gna_impf> баста- as an aux. Also, unlike all other verbs in the above pull request, баста is not listed among the auxiliary verbs in Kazakh: A Comprehensive Grammar, although it probably should be handled as such as we do currently.
    • Screenshot from 2020-10-13 01-01-14 in Оразбаева, Ф.Ш., Г. Сағидолда, Б. Қасым, А. Қобыланова, Қ. Есенова, Ұ. Исабекова, Қ. Қасабек, Ж. Балтабаев, Қ. Мұхамади, Р. Рахметова & Ж. Көпбаева. 2012. Қазіргі қазақ тілі. Алматы: Нур-Принт.
IlnarSelimcan commented 4 years ago

https://github.com/taruen/ud-tools/commit/d0819a0295304bb1d7e69c2489e5e37c1fe2f206 solves most of the validation issues related to auxiliaries except for the following:

[Line 7883 Sent udhr.tagged.txt:7:305 Node 8]: [L3 Syntax leaf-aux-cop] 'cop' not expected to have children (8:болса:cop --> 9:да:advmod)
[Line 10375 Sent Иран.tagged.txt:52:1481 Node 12]: [L3 Syntax rel-upos-aux] 'aux' should be 'AUX' but it is 'VERB'
[Line 12859 Sent wikipedia.tagged.txt:71:1163 Node 5]: [L5 Morpho aux-lemma] 'атан' is not an auxiliary verb in language [kk]
[Line 13815 Sent Шымкент.tagged.txt:6:156 Node 2]: [L3 Syntax leaf-aux-cop] 'cop' not expected to have children (2:болғанда:cop --> 3:да:advmod)

атан in 12859 looks like a main verb.