apertium / apertium-turkic

For code, data and issues relating to all Turkic monolingual packages or Turkic-X translators.
0 stars 0 forks source link

how to handle да in Kyrgyz #17

Open jonorthwash opened 11 months ago

jonorthwash commented 11 months ago

In Kyrgyz, there are at least two uses of the stand-alone word да:

  1. a post-predicate "modal particle", meaning something like "the speaker is making a statement whose truth value they believe to be evident to the interlocutor, but it's needed to be stated to explain something else"
  2. a "postadverb" (except it comes after pretty much any non-verbal phrase) meaning something like "also"
  3. it can also translate to "even" but the distribution is about the same, so I'm not sure this is a different meaning than (2)
  4. it can mean "both ... and" when repeated, but that could potentially be analysed as two postadverbs (2/3) instead of as correlative conjunctions, despite the semantic equivalence with the English correlative conjunction. Perhaps it's correlative postadverbs, with a literal gloss roughly something like "even X even Y"? Although it does (semantically) conjoin parallel structures so perhaps it's (syntactically) essentially a conjunction too?

So, currently the transducer has the following mapping from the above uses to tags:

  1. <mod_tru>
  2. <cnjadv>
  3. <postadv>
  4. <cnjcoo>

This situation is problem 1. I'd propose uniting this to simply <mod> for (1) (see #15) and <postadv> (or maybe <cnjadv>) for (2, 3, 4).

Note that all of these are always "да", and the consonant and vowel don't interact phonologically with the preceding element as they do in Kazakh, Tatar, Turkish, etc.

Problem 2 is tokenisation.

Currently <cnjadv>, <cnjcoo>, and <postadv> are tokenised separately and <mod_tru> is tokenised together with the previous constituent. This is intended to keep things parallel to Kazakh where it's needed to tokenise that way for morphophonological reasons. Note that there are a few other postadverbs too, some with fairly different distribution.

However, because of greedy tokenisation, we usually only get one of the readings: бала да/бала<n><nom>+э<cop><aor><p3><sg>+да<mod_tru>

Should all uses be (1) tokenised separately from the word before, or should (2) the syntax be encoded in the morphotactics like in other Turkic transducers? Both would ideally make all analyses available, the difference being that the former (1) would need more difficult disambiguation and the latter (2) would need more difficult morphotactics.