apertium / apertium-kaz

Apertium linguistic data for Kazakh
https://apertium.github.io/apertium-kaz/
GNU General Public License v3.0
17 stars 9 forks source link

[puupankki.conllu] мың миллион миллиард млн млрд трлн are inconsistent #24

Open IlnarSelimcan opened 3 years ago

IlnarSelimcan commented 3 years ago

General context: https://github.com/apertium/apertium-kaz/pull/17

Actually several related issues:

  1. мың and миллиард are NUM num everywhere, while миллион in some cases is NUM num, and in others NOUN n.

  2. млрд. and трлн. are NOUN abbr everywhere, while млн. is some cases tagged as NUM num, in others as NOUN abbr.

  3. (a)

    4   2   2   NUM num NumType=Card    5   compound    _   _
    5   миллиард    миллиард    NUM num NumType=Card    6   compound    _   _
    6   300 300 NUM num NumType=Card    7   compound    _   _
    7   миллион миллион NUM num NumType=Card    8   nummod  _   _
    8   теңгеден    теңге   NOUN    n   Case=Abl    10  nmod    _   _
    9   астам   астам   ADJ adj _   10  amod    _   _
    10  қаржы   қаржы   NOUN    n   Case=Nom    11  obj _   _

vs (b)

3   4,3 4,3 NUM num NumType=Card    4   nummod  _   _
4   мыңнан  мың NUM num Case=Abl|NumType=Card,Ord   6   nmod    _   _
5   астам   астам   ADJ adj _   6   amod    _   _
6   шақырымды   шақырым NOUN    n   Case=Acc    7   obj _   _

Hereby I suggest:

[quote https://universaldependencies.org/u/pos/all.html#sym-symbol]

Strings that consists entirely of alphanumeric characters are not symbols but they may be proper nouns: 130XE, DC10; others may be tagged PROPN (rather than SYM) even if they contain special characters: DC-10. Similarly, abbreviations for single words are not symbols but are assigned the part of speech of the full form. For example, Mr. (mister), kg (kilogram), km (kilometer), Dr (Doctor) should be tagged nouns. Acronyms for proper names such as UN and NATO should be tagged as proper nouns.

[unquote]

but also generally speaking knowing the POS of the unabbreviated form is considered helpful for applications.

UPDATE: note that in UD there is the Abbr feature: https://universaldependencies.org/u/feat/Abbr.html