Open makazhan opened 8 years ago
BTW, shouldn't бірақ here be SCONJ:
1 Бірақ бірақ _ cnjcoo _ 4 cc _ _
2 кеше кеше _ adv _ 4 advmod _ _
3 өте өте _ adv _ 4 advmod _ _
4 суық суық _ adj _ 0 root _ _
5 еді е _ cop ifi|p3|sg 4 cop _ _
6 ! ! _ sent _ 4 punct _ _
In terms of tokenisation, the mwe relationship is fine, but I really don't like the head-initial aspect of it. Kazakh just is not head initial, and I don't understand why the UD devs keep insisting that all languages need to be treated that way. :(
I'm also okay with SCONJ
for cnjadv
. I'd like to understand mark
better and your argument for using it, though.
Yeah, I'm fine with SCONJ for cnjadv
too. And I'm fine with mark
:
A marker is the word introducing a finite clause subordinate to another clause. For a complement clause, this is words like [en] that or whether. For an adverbial clause, the marker is typically a subordinating conjunction like [en] while or although. The mark is a dependent of the subordinate clause head. In a relative clause, it is a normally uninflected word, which simply introduces a relative clause, such as [he] še. (In this last use, one needs to distinguish between relative clause markers, which are mark from relative pronouns, which fill a regular verbal argument or modifier grammatical relation.
We have cnjsub
for complement clauses and cnjadv
for adverbial clauses.
Wait, what are we proposing using mark
for? That description doesn't sound like anything that Kazakh has...
Себебі can introduce a subordinate clause, and things like өйткені, but not осы себептен and не болмаса and the like.
The reason given in Uppsala was basically that a lot of these things aren't "really" dependency relations. So e.g. co-ordination is not a dependency relation, nor is the mwe
or name
relation. And as they are not real dependency relations, they advocate a technical solution, which is make everything attach to the left. I don't agree with this, but I can see their reasoning, and the best we can do is just keep arguing against it. Note that for the first version, I don't think it matters, there are going to be plenty of treebanks that don't follow that rule yet. And perhaps it will change later.
Okay, so let's do head-final stuff? And I still don't agree about mark for words like осы себептен and не болмаса.
I was kidding about getting used to UD head-initial stuff :) I'm pro-head-final! :)
My intuition on subordinating conjunction is that it is a function word that connects clauses.
Wikipedia says:
Subordinating conjunctions, also called subordinators, are conjunctions that join an independent clause and a dependent clause, and also introduce adverb clauses. The most common subordinating conjunctions in the English language include after, although, as, as far as, as if, as long as, as soon as, as though, because, before, even if, even though, every time, if, in order that, since, so, so that, than, though, unless, until, when, whenever, where, whereas, wherever, and while. Complementizers can be considered to be special subordinating conjunctions that introduce complement clauses: e.g. "I wonder whether he'll be late. I hope that he'll be on time". Some subordinating conjunctions (until and while), when used to introduce a phrase instead of a full clause, become prepositions with identical meanings.
tmk, there are no complementizers in Kazakh, i.e. everything works through verbs. That leaves us with conjunctions that introduce adverbial clauses (by and large).
For this purpose UD has a special relation mark
, which to me seems appropriate. What SCONJ
does doesn't seem like coordination to me anyway. It's one of those "synthetic" dependencies, which might as well be mark
.
On the other hand, this example from English seems to be guided by formal use of because as SCONJ
and uses mark
relation when there's no subordinating clause whatsoever (or it is implied in a question to which this sentence seems to be an answer):
1 Because because SCONJ IN _ 4 mark _ _
2 Large large ADJ JJ Degree=Pos 3 amod _ _
3 Fries fries NOUN NNS Number=Plur 4 nsubj _ _
4 give give VERB VBP Mood=Ind|Tense=Pres|VerbForm=Fin 0 root _ _
5 you you PRON PRP Case=Acc|Person=2|PronType=Prs 4 iobj _ _
6 FOUR four NUM CD NumType=Card 7 nummod _ _
7 PIECES piece NOUN NNS Number=Plur 4 dobj _ SpaceAfter=No
8 ! ! PUNCT . _ 4 punct _ _
Which is fine by me, and I am inclined to treat бірақ accordingly in the following:
1 Бірақ бірақ _ cnjcoo _ 4 cc _ _
2 кеше кеше _ adv _ 4 advmod _ _
3 өте өте _ adv _ 4 advmod _ _
4 суық суық _ adj _ 0 root _ _
5 еді е _ cop ifi|p3|sg 4 cop _ _
6 ! ! _ sent _ 4 punct _ _
What I cannot understand is why you guys have a special treatment for осы себептен.
How is it different from сол/бұл/сондай/белгілі себептен(терден)?
To me it's just a simple noun phrase that happens to be adjunct of a verb and should be subordinated as nmod
I think I follow your argument now. Would егер be a mark too then?
To me it's just a simple noun phrase that happens to be adjunct of a verb and should be subordinated as
nmod
That's probably right. The reason is that it's in our transducer as a multiword, mainly for translation purposes. In theory, it shouldn't be in the vanilla transducer (which is probably what we should be using), except that it probably is in there because the whole transducer is a mess.
Francis states that cnjadv maps to SCONJ in UD. I agree with this. Jonathan?
Tokenization: It seems to me that currently multi-word cnjadv are treated as a single token, regardless of phonological stuff, e.g осы себептен (same for conjuncts, e.g. не болмаса)
I really don't like this. I think that at least phonology insensitive stuff should be tokenized separately. I understand that generation difficulties may not be only due to phonology, but isn't there any way around this?
I suggest multi-word expressions like this:
or:
Note everything is head initial: getting used to UD:)
Dependency: Current treebank has a two-way treatment for this:
1) advmod:
2) cc:
According to UD whenever SCONJ introduces an adverbial clause (which seems its main function anyway) it should depend on the predicate of that clause, and relation used is mark.
These are the frequencies of relations used with SCONJ in the English treebank:
Note it's never cc.
Here's one of the sentences where it's advmod:
Could be mark in my opinion.
Here's one of the sentences where there's no subordinate clauses and still mark is used:
Looks similar to the first example for Kazakh (1 advmod)
So given all this, I suggest to use mark relation.
TL;DR: