ftyers / docs

Universal Dependencies online documentation
http://universaldependencies.github.io/docs/
Apache License 2.0
1 stars 0 forks source link

[ud] Release v1.3: pos mapping: cnjadv => SCONJ #20

Open makazhan opened 8 years ago

makazhan commented 8 years ago

Francis states that cnjadv maps to SCONJ in UD. I agree with this. Jonathan?

Tokenization: It seems to me that currently multi-word cnjadv are treated as a single token, regardless of phonological stuff, e.g осы себептен (same for conjuncts, e.g. не болмаса)

I really don't like this. I think that at least phonology insensitive stuff should be tokenized separately. I understand that generation difficulties may not be only due to phonology, but isn't there any way around this?

I suggest multi-word expressions like this:

1   Осы  осы  _   prn dem|nom 3   advmod  _   _
2   себептен    себеп  _   n   abl 1   mwe _   _
3   бардым    бар  _   v|iv|past|p1|sg _   0   root    _   _

or:

1   мен  мен  _   prn pers|nom    0   root    _   _
2   не    не    _   n   abl 1   cc  _   _
3   болмаса  болмаса  _   cnjcoo  _   2   mwe _   _
4   сен  сен  _   prn pers|nom    1   conj    _   _

Note everything is head initial: getting used to UD:)

Dependency: Current treebank has a two-way treatment for this:

1) advmod:

1   Себебі    себебі    _   cnjadv  _   8   advmod  _   _
2   :   :   _   sent    _   8   punct   _   _
3   ата-анасы   ата-ана   _   n   px3sp|nom   5   conj    _   _
4   ,   ,   _   cm  _   5   punct   _   _
5   ағайын-туғаны   ағайын-туғаны   _   n   px3sp|nom   8   subj    _   _
6   бір  бір  _   num _   7   nummod  _   _
7   жағынан  жақ  _   n   px3sp|abl   8   nmod    _   _
8   бұзып  бұз  _   v   tv|prc_perf 0   root    _   _
9   жатыр  жат  _   vaux    pres|p3|sg  8   aux _   _
10  .   .   _   sent    _   8   punct   _   _

2) cc:

1   Осы себептен осы себептен _   cnjadv  _   4   cc  _   _
2   олар    олар    _   prn pers|p3|pl|nom  4   subj    _   _
3   далада    дала    _   n   loc 4   nmod    _   _
4   ойнай  ойна    _   v   tv|prc_impf 0   root    _   _
5   алмады    ал    _   vaux    neg|ifi|p3|pl   4   aux _   _
6   .   .   _   sent    _   4   punct   _   _

According to UD whenever SCONJ introduces an adverbial clause (which seems its main function anyway) it should depend on the predicate of that clause, and relation used is mark.

These are the frequencies of relations used with SCONJ in the English treebank:

   4462 mark
     92 case
     58 mwe
     10 nmod
      6 advmod
      2 conj
      1 reparandum

Note it's never cc.

Here's one of the sentences where it's advmod:

1       A       a       DET     DT      Definite=Ind|PronType=Art       2       det     _       _
2       couple  couple  NOUN    NN      Number=Sing     5       nmod:npmod      _       _
3       of      of      ADP     IN      _       4       case    _       _
4       months  month   NOUN    NNS     Number=Plur     2       nmod    _       _
5       after   after   SCONJ   IN      _       7       advmod  _       _
6       I       I       PRON    PRP     Case=Nom|Number=Sing|Person=1|PronType=Prs      7       nsubj   _       _
7       got     get     VERB    VBD     Mood=Ind|Tense=Past|VerbForm=Fin        0       root    _       _
8       parakeets       parakeet        NOUN    NNS     Number=Plur     10      compound        _       SpaceAfter=No
9       /       /       PUNCT   ,       _       10      punct   _       SpaceAfter=No
10      budgies budgie  NOUN    NNS     Number=Plur     7       dobj    _       SpaceAfter=No
11      .       .       PUNCT   .       _       7       punct   _       _

Could be mark in my opinion.

Here's one of the sentences where there's no subordinate clauses and still mark is used:

1   Because because SCONJ   IN  _   4   mark    _   _
2   Large   large   ADJ JJ  Degree=Pos  3   amod    _   _
3   Fries   fries   NOUN    NNS Number=Plur 4   nsubj   _   _
4   give    give    VERB    VBP Mood=Ind|Tense=Pres|VerbForm=Fin    0   root    _   _
5   you you PRON    PRP Case=Acc|Person=2|PronType=Prs  4   iobj    _   _
6   FOUR    four    NUM CD  NumType=Card    7   nummod  _   _
7   PIECES  piece   NOUN    NNS Number=Plur 4   dobj    _   SpaceAfter=No
8   !   !   PUNCT   .   _   4   punct   _   _

Looks similar to the first example for Kazakh (1 advmod)

So given all this, I suggest to use mark relation.

TL;DR:

  1. cnjadv => SCONJ;
  2. split tokenize whenever possible;
  3. mark relation should be proffered.
makazhan commented 8 years ago

BTW, shouldn't бірақ here be SCONJ:

1   Бірақ  бірақ  _   cnjcoo  _   4   cc  _   _
2   кеше    кеше    _   adv _   4   advmod  _   _
3   өте  өте  _   adv _   4   advmod  _   _
4   суық    суық    _   adj _   0   root    _   _
5   еді  е  _   cop ifi|p3|sg   4   cop _   _
6   !   !   _   sent    _   4   punct   _   _
jonorthwash commented 8 years ago

In terms of tokenisation, the mwe relationship is fine, but I really don't like the head-initial aspect of it. Kazakh just is not head initial, and I don't understand why the UD devs keep insisting that all languages need to be treated that way. :(

jonorthwash commented 8 years ago

I'm also okay with SCONJ for cnjadv. I'd like to understand mark better and your argument for using it, though.

ftyers commented 8 years ago

Yeah, I'm fine with SCONJ for cnjadv too. And I'm fine with mark:

A marker is the word introducing a finite clause subordinate to another clause. For a complement clause, this is words like [en] that or whether. For an adverbial clause, the marker is typically a subordinating conjunction like [en] while or although. The mark is a dependent of the subordinate clause head. In a relative clause, it is a normally uninflected word, which simply introduces a relative clause, such as [he] še. (In this last use, one needs to distinguish between relative clause markers, which are mark from relative pronouns, which fill a regular verbal argument or modifier grammatical relation.

We have cnjsub for complement clauses and cnjadv for adverbial clauses.

jonorthwash commented 8 years ago

Wait, what are we proposing using mark for? That description doesn't sound like anything that Kazakh has...

jonorthwash commented 8 years ago

Себебі can introduce a subordinate clause, and things like өйткені, but not осы себептен and не болмаса and the like.

ftyers commented 8 years ago

The reason given in Uppsala was basically that a lot of these things aren't "really" dependency relations. So e.g. co-ordination is not a dependency relation, nor is the mwe or name relation. And as they are not real dependency relations, they advocate a technical solution, which is make everything attach to the left. I don't agree with this, but I can see their reasoning, and the best we can do is just keep arguing against it. Note that for the first version, I don't think it matters, there are going to be plenty of treebanks that don't follow that rule yet. And perhaps it will change later.

jonorthwash commented 8 years ago

Okay, so let's do head-final stuff? And I still don't agree about mark for words like осы себептен and не болмаса.

makazhan commented 8 years ago

I was kidding about getting used to UD head-initial stuff :) I'm pro-head-final! :)

My intuition on subordinating conjunction is that it is a function word that connects clauses.

Wikipedia says:

Subordinating conjunctions, also called subordinators, are conjunctions that join an independent clause and a dependent clause, and also introduce adverb clauses. The most common subordinating conjunctions in the English language include after, although, as, as far as, as if, as long as, as soon as, as though, because, before, even if, even though, every time, if, in order that, since, so, so that, than, though, unless, until, when, whenever, where, whereas, wherever, and while. Complementizers can be considered to be special subordinating conjunctions that introduce complement clauses: e.g. "I wonder whether he'll be late. I hope that he'll be on time". Some subordinating conjunctions (until and while), when used to introduce a phrase instead of a full clause, become prepositions with identical meanings.

tmk, there are no complementizers in Kazakh, i.e. everything works through verbs. That leaves us with conjunctions that introduce adverbial clauses (by and large).

For this purpose UD has a special relation mark, which to me seems appropriate. What SCONJ does doesn't seem like coordination to me anyway. It's one of those "synthetic" dependencies, which might as well be mark.

On the other hand, this example from English seems to be guided by formal use of because as SCONJ and uses mark relation when there's no subordinating clause whatsoever (or it is implied in a question to which this sentence seems to be an answer):

1   Because because SCONJ   IN  _   4   mark    _   _
2   Large   large   ADJ JJ  Degree=Pos  3   amod    _   _
3   Fries   fries   NOUN    NNS Number=Plur 4   nsubj   _   _
4   give    give    VERB    VBP Mood=Ind|Tense=Pres|VerbForm=Fin    0   root    _   _
5   you you PRON    PRP Case=Acc|Person=2|PronType=Prs  4   iobj    _   _
6   FOUR    four    NUM CD  NumType=Card    7   nummod  _   _
7   PIECES  piece   NOUN    NNS Number=Plur 4   dobj    _   SpaceAfter=No
8   !   !   PUNCT   .   _   4   punct   _   _

Which is fine by me, and I am inclined to treat бірақ accordingly in the following:

1   Бірақ  бірақ  _   cnjcoo  _   4   cc  _   _
2   кеше    кеше    _   adv _   4   advmod  _   _
3   өте  өте  _   adv _   4   advmod  _   _
4   суық    суық    _   adj _   0   root    _   _
5   еді  е  _   cop ifi|p3|sg   4   cop _   _
6   !   !   _   sent    _   4   punct   _   _

What I cannot understand is why you guys have a special treatment for осы себептен.

How is it different from сол/бұл/сондай/белгілі себептен(терден)?

To me it's just a simple noun phrase that happens to be adjunct of a verb and should be subordinated as nmod

jonorthwash commented 8 years ago

I think I follow your argument now. Would егер be a mark too then?

To me it's just a simple noun phrase that happens to be adjunct of a verb and should be subordinated as nmod

That's probably right. The reason is that it's in our transducer as a multiword, mainly for translation purposes. In theory, it shouldn't be in the vanilla transducer (which is probably what we should be using), except that it probably is in there because the whole transducer is a mess.