UniversalDependencies / UD_Portuguese-PUD

Parallel Universal Dependencies.
5 stars 3 forks source link

root analysis wrong: `diferente' #14

Open vcvpaiva opened 3 years ago

vcvpaiva commented 3 years ago

sent_id = n01001013 text = Para aqueles que seguem as transições das redes sociais no Capitol Hill, esta será um pouco diferente. text_en = For those who follow social media transitions on Capitol Hill, this will be a little different.

1 Para ADP IN 2 case 2 aqueles PRON PDEM Gender=Masc|Number=Plur 19 nmod ToDo=nmod 3 que PRON WP 4 nsubj 4 seguem VERB VBC Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres 2 acl:relcl 5 as DET DT Gender=Fem|Number=Plur 6 det 6 transições NOUN NN Gender=Fem|Number=Plur 4 obj 7-8 das 7 de de ADP INDT 9 case 8 as o DET Gender=Fem|Number=Plur 9 det 9 redes NOUN NN Gender=Fem|Number=Plur 6 nmod 10 sociais ADJ JJ Gender=Fem|Number=Plur 9 amod 11-12 no 11 em em ADP INDT 13 case 12 o o DET Gender=Masc|Number=Sing 13 det 13 Capitol PROPN NNP Gender=Masc|Number=Sing 4 obl 14 Hill PROPN NNP Foreign=Yes|Gender=Masc|Number=Sing 13 flat SpaceAfter=No 15 , PUNCT , 2 punct 16 esta PRON PDEM Gender=Fem|Number=Sing 19 nsubj 17 será AUX VBC Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Fut 19 cop 18 um DET DT Gender=Masc|Number=Sing 19 det 19 pouco NOUN NN Gender=Masc|Number=Sing 0 root 20 diferente ADJ JJ Gender=Masc|Number=Sing 19 amod SpaceAfter=No 21 . PUNCT . 19 punct

The root should be the adjective 'diferente', like in English ('different'), not the word 'pouco', which here modifies 'different', it's NOT a noun.

vcvpaiva commented 3 years ago

another one: (root in EN is `average/media' not the verb "ESTAR")

sent_id = n01004017 text = Eles estão, na média nacional, na categoria 4 e, melhores que a média nacional, na categoria 8. texten = They are at the national average in grade 4 and better than national average in grade 8. 1 Eles PRON PRP Case=Nom|Gender=Masc|Number=Plur|Person=3 2 nsubj 2 estão VERB VBC Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres 0 root SpaceAfter=No 3 , PUNCT , 6 punct 4-5 na 4 em em ADP INDT 6 case 5 a o DET Gender=Fem|Number=Sing 6 det 6 média NOUN NN Gender=Fem|Number=Sing 2 obl 7 nacional ADJ JJ Gender=Fem|Number=Sing 6 amod SpaceAfter=No 8 , PUNCT , 11 punct 9-10 na ToDo=ex-adp-child 9 em em ADP INDT 11 case 10 a o DET Gender=Fem|Number=Sing 11 det 11 categoria NOUN NN Gender=Fem|Number=Sing 6 nmod 12 4 NUM CD 11 appos 13 e CCONJ CC 15 cc SpaceAfter=No|ToDo=ex-adp-child 14 , PUNCT , 13 punct 15 melhores ADJ JJR Gender=Masc|Number=Plur 6 conj ToDo=ex-adp-child 16 que ADP IN 18 case 17 a DET DT Gender=Fem|Number=Sing 18 det 18 média NOUN NN Gender=Fem|Number=Sing 15 nmod ToDo=nmod 19 nacional ADJ JJ Gender=Fem|Number=Sing 18 amod SpaceAfter=No 20 , PUNCT , 23 punct 21-22 na 21 em em ADP INDT 23 case 22 a o DET Gender=Fem|Number=Sing 23 det 23 categoria NOUN NN Gender=Fem|Number=Sing 15 nmod ToDo=nmod 24 8 NUM CD 23 appos SpaceAfter=No 25 . PUNCT . 2 punct

vcvpaiva commented 3 years ago

another one: root in EN is power, root in PT is the verb "to be"

newdoc id = n01018 sent_id = n01018024 text = É como um superpoder, às vezes. texten = It's like a super power sometimes. 1 É VERB VBC Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres 0 root 2 como ADP IN 4 case 3 um DET DT Gender=Masc|Number=Sing 4 det 4 superpoder NOUN NN Gender=Masc|Number=Sing 1 obl SpaceAfter=No 5 , PUNCT , 6 punct 6-7 às 6 a a ADP INDT 1 discourse 7 as o DET Gender=Fem|Number=Plur 6 fixed 8 vezes NOUN NN Gender=Fem|Number=Plur 6 fixed SpaceAfter=No 9 . PUNCT . 1 punct _

vcvpaiva commented 3 years ago

this one seems wrong in EN, maybe add another issue there. the root seems to me to be 'attack' not likely!

sentid = n01021011 text = The "recent events" are likely to be the attacks of 21 October that briefly took down popular websites such as Reddit, Twitter and Spotify as well as many others. 1 The the DET DT Definite=Def|PronType=Art 4 det 4:det 2 " " PUNCT `` 4 punct 4:punct SpaceAfter=No 3 recent recent ADJ JJ Degree=Pos 4 amod 4:amod 4 events event NOUN NNS Number=Plur 7 nsubj 7:nsubj|11:nsubj:xsubj SpaceAfter=No 5 " " PUNCT '' 4 punct 4:punct 6 are be AUX VBP Mood=Ind|Tense=Pres|VerbForm=Fin 7 cop 7:cop 7 likely likely ADJ JJ Degree=Pos 0 root 0:root 8 to to PART TO 11 mark 11:mark 9 be be AUX VB VerbForm=Inf 11 cop 11:cop 10 the the DET DT Definite=Def|PronType=Art 11 det 11:det 11 attacks attack NOUN NNS Number=Plur 7 xcomp 7:xcomp|17:nsubj 12 of of ADP IN 13 case 13:case 13 21 21 NUM CD NumType=Card 11 nmod 11:nmod:of 14 October October PROPN NNP Number=Sing 13 flat 13:flat 15 that that PRON WDT PronType=Rel 17 nsubj 11:ref 16 briefly briefly ADV RB 17 advmod 17:advmod 17 took take VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 11 acl:relcl 11:acl:relcl 18 down down ADP RP 17 compound:prt 17:compound:prt 19 popular popular ADJ JJ Degree=Pos 20 amod 20:amod 20 websites website NOUN NNS Number=Plur 17 obj 17:obj 21 such such ADJ JJ Degree=Pos 23 case 23:case 22 as as ADP IN 21 fixed 21:fixed 23 Reddit Reddit PROPN NNP Number=Sing 20 nmod 20:nmod:suchas SpaceAfter=No 24 , , PUNCT , 25 punct 25:punct _ 25 Twitter Twitter PROPN NNP Number=Sing 23 conj 20:nmod:such_as|23:conj:as_wellas 26 and and CCONJ CC 27 cc 27:cc 27 Spotify Spotify PROPN NNP Number=Sing 23 conj 20:nmod:suchas|23:conj:and 28 as as ADV RB 32 cc 32:cc 29 well well ADV RB Degree=Pos 28 fixed 28:fixed 30 as as ADP IN 28 fixed 28:fixed 31 many many ADJ JJ Degree=Pos 32 amod 32:amod 32 others other NOUN NNS Number=Plur 23 conj 20:nmod:suchas|23:conj:and SpaceAfter=No 33 . . PUNCT . 7 punct 7:punct _

dan-zeman commented 3 years ago

The last one (English) is IMHO correct. The adjective likely is modified by the infinitival clause to be the attacks...

arademaker commented 3 years ago

We have 3 cases of estão as root:

% awk '$2 ~ /^estão/' pt_pud-ud-test.conllu | grep root
2   estão   _   VERB    VBC Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres 0   root    _   SpaceAfter=No
5   estão   _   VERB    VBC Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres 0   root    _   _
2   estão   _   VERB    VBC Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres 0   root    _   _

But I agree that estar is cop in Portuguese, called verbo de ligação. So we can open another issue for consider all cases of estar/estão?

Regarding um pouco X, in the http://github.com/universaldependencies/UD_Portuguese-Bosque, this construction is analyzed as fixed. I don't like it. I would prefer the way I made in the last commit: um/DET det> X and pouco/ADV advmod> X.

The comment about a possible error in English should be reported in the English repository.

arademaker commented 3 years ago

commit c16be39 updated stats.xml

vcvpaiva commented 3 years ago

The last one (English) is IMHO correct. The adjective likely is modified by the infinitival clause to be the attacks...

@dan-zeman , my understanding of the sentence is that `likely' is not an adjective here, but an adverb. the sentence could be paraphrased as Possibly, the "recent events" are the attacks of 21 October that did X(= briefly took down popular websites such as Reddit, Twitter and Spotify as well as many others.) if so "attacks" should be the root.

vcvpaiva commented 3 years ago

@arademaker 'estar' can be both aux or full verb. I don't know from the lines above if this commit was a good change or not.

I know that it does not get all the modifications necessary because there are "estar" considered as full verbs that are not "root" e.g.

sent_id = n01022027 text = É fantástico que eles tenham conseguido o Acordo de Paris, mas as suas contribuições no momento não estão nem perto do objectivo de 1.5 grau. texten = It's fantastic that they got the Paris Agreement but their contributions at the moment are nowhere near the 1.5-degree target. 1 É AUX VBC Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres 2 cop 2 fantástico ADJ JJ Gender=Masc|Number=Sing 0 root 3 que ADP IN 6 mark 4 eles PRON PRP Case=Nom|Gender=Masc|Number=Plur|Person=3 6 nsubj 5 tenham VERB VBC Aspect=Imp|Mood=Sub|Number=Plur|Person=3|Tense=Pres 6 aux 6 conseguido VERB VBN Aspect=Perf 2 csubj 7 o DET DT Gender=Masc|Number=Sing 8 det 8 Acordo NOUN NN Gender=Masc|Number=Sing 6 obj Proper=True 9 de ADP IN 10 case Proper=True 10 Paris PROPN NNP Gender=Fem|Number=Sing 8 nmod SpaceAfter=No 11 , PUNCT , 12 punct 12 mas CCONJ CC 20 cc 13 as DET PDT Gender=Fem|Number=Plur 15 det:predet 14 suas PRON DTP$ Gender=Fem|Number=Plur|Number[psor]=Plur|Person=3|PronType=Prs 15 det 15 contribuições NOUN NN Gender=Fem|Number=Plur 20 nsubj 16-17 no 16 em em ADP INDT 18 case 17 o o DET Gender=Masc|Number=Sing 18 det 18 momento NOUN NN Gender=Masc|Number=Sing 15 nmod 19 não ADV RB Polarity=Neg 20 advmod 20 estão VERB VBC Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres 2 conj 21 nem ADV RB Polarity=Neg 22 advmod 22 perto ADV RB 20 advmod 23-24 do 23 de de ADP INDT 25 case 24 o o DET Gender=Masc|Number=Sing 25 det 25 objectivo NOUN NN Gender=Masc|Number=Sing 22 obl 26 de ADP IN 28 case 27 1.5 NUM CD Gender=Masc 28 nummod 28 grau NOUN NN Gender=Masc|Number=Sing 25 nmod SpaceAfter=No 29 . PUNCT . 2 punct _

line 20 is not a full verb, but auxiliary and cop.

HOWEVER If one of your commits above is newdoc id = n01070 sent_id = n01070016 text = Mais de 330 tripulantes estão a bordo do navio. texten = More than 330 crew are onboard the ship. 1 Mais ADV RBR 5 nsubj 2 de ADP IN 4 case 3 330 NUM CD Gender=Masc 4 nummod 4 tripulantes NOUN NN Gender=Masc|Number=Plur 1 obl 5 estão VERB VBC Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres 0 root 6 a ADP IN 7 case 7 bordo NOUN NN Gender=Masc|Number=Sing 5 obl 8-9 do 8 de de ADP INDT 10 case 9 o o DET Gender=Masc|Number=Sing 10 det 10 navio NOUN NN Gender=Masc|Number=Sing 7 nmod SpaceAfter=No 11 . PUNCT . 5 punct _

I think it's WRONG, as this instance of 'estar' is root and full verb, in my opinion. ('estar' here can be paraphrased as "permanecem".)

vcvpaiva commented 3 years ago

@ardemaker If the two other "estao" (that you corrected) were

newdoc id = n01122 sent_id = n01122024 text = Super-heróis estão fora da experiência humana e isto também está, então o tratei como um drama, diz Zimmer. text_en = “Superheroes are outside of human experience and so is this, so I treated it like a drama,” Zimmer says.


sent_id = n01004017 text = Eles estão, na média nacional, na categoria 4 e, melhores que a média nacional, na categoria 8. text_en = They are at the national average in grade 4 and better than national average in grade 8.

these indeed needed correction.

dan-zeman commented 3 years ago

tripulantes estão a bordo do navio

According to the UD v2 guidelines, estão is a copula here, too (and paraphrasing doesn't play a role). There were vigorous debates about this point when the v2 guidelines were being prepared, as it goes against the grammatical tradition in several languages. But since the guidelines were passed, they should be followed as much as possible, so that the intended cross-language parallelism is achieved.

vcvpaiva commented 3 years ago

tripulantes estão a bordo do navio

According to the UD v2 guidelines, estão is a copula here, too (and paraphrasing doesn't play a role). There were vigorous debates about this point when the v2 guidelines were being prepared, as it goes against the grammatical tradition in several languages. But since the guidelines were passed, they should be followed as much as possible, so that the intended cross-language parallelism is achieved.

ok, my bad then. (this one I do not resent, as I do the "United States" one, because copulas are crazy in any case!)

vcvpaiva commented 3 years ago

@arademaker corrected sent_id = n01001013, thanks!

But sent_id = n01004017 is still wrong.

@dan-zeman can you confirm that your sad-face symbol means you agree that "average" is the root, not the verb "to be/estar", as this is copula?

vcvpaiva commented 3 years ago

@dan-zeman we disagree on the English sentence

text = The "recent events" are likely to be the attacks of 21 October that briefly took down popular websites such as Reddit, Twitter and Spotify as well as many others.

I insist that "likely" in this sentence is adverb and that the root is "attacks", but the Portuguese version agrees with me, so no worries.

vcvpaiva commented 3 years ago

my bad on sentence sent_id = n01004017 text = Eles estão, na média nacional, na categoria 4 e, melhores que a média nacional, na categoria 8. text_en = They are at the national average in grade 4 and better than national average in grade 8. above! the translation is COMPLETELY WRONG!!! it gets its own issue.

dan-zeman commented 3 years ago
sent_id = n01004017
text = Eles estão, na média nacional, na categoria 4 e, melhores que a média nacional, na categoria 8.

@vcvpaiva : It was a "confused face", according to Github's explanation :-) But I agree that estão should be treated as a copula and média should be the root. I suppose that this will hold even in the corrected translation.

vcvpaiva commented 3 years ago

Thanks @dan-zeman, indeed with the new translation still média should be the root, estão should be treated as a copula.