UniversalDependencies / UD_Portuguese-Bosque

This Universal Dependencies (UD) Portuguese treebank.
Other
48 stars 11 forks source link

Bosque-UD release 2.5 #271

Closed alvelvis closed 4 years ago

alvelvis commented 4 years ago

Olá! Participaram das correções para o release 2.5: Cláudia Freitas, Elvis de Souza, Aline Silveira, Tatiana Cavalcanti e Wograine Evelyn. Mudar apenas o README.org para incluir os autores é suficiente?

A documentação sobre o UD em português (e para língua portuguesa) está a caminho, com as últimas decisões já tomadas neste release.

arademaker commented 4 years ago

@alvelvis and @claudiafreitas,

Before merging the data, I checked the files to look for basic validation errors. I used the validate.py script from https://github.com/universaldependencies/tools

$ grep -i FAILED report-bosque-master.log  | wc -l
    1034
$ grep -i FAILED report-bosque-alvelvis.log | wc -l
    1153

That is, the number of files in documents/ that have errors increased in 119. Now consider the number of errors found in the files, we have 150 more errors:

$ grep -i FAILED report-bosque-master.log | awk 'BEGIN {sum=0} {sum=sum+$5} END{print sum}'
3381
$ grep -i FAILED report-bosque-alvelvis.log | awk 'BEGIN {sum=0} {sum=sum+$5} END{print sum}'
3531

Are you aware of that? Did you execute any validation before this PR?

arademaker commented 4 years ago

I am attaching the logs, but if you consider the errors in the current master, we have 4 errors if we ignore errors like

[Line 68 Sent CF241-2 Node 13]: [L3 Syntax punct-causes-nonproj] Punctuation must not cause non-projectivity of nodes [15]

See:

$ grep -v PASSED report-bosque-master.log | grep -v Punctuation | grep -v FAILED | grep -v 'Syntax errors'
[Line 167 Sent CF391-4 Node 55]: [L3 Syntax orphan-parent] The parent of 'orphan' should normally be 'conj' but it is 'orphan'.
[Line 43 Sent CF735-4 Node 8]: [L3 Syntax orphan-parent] The parent of 'orphan' should normally be 'conj' but it is 'obj'.
[Line 122 Sent CF852-6 Node 21]: [L3 Syntax orphan-parent] The parent of 'orphan' should normally be 'conj' but it is 'nsubj'.
[Line 15 Sent CP746-1 Node 10]: [L3 Syntax orphan-parent] The parent of 'orphan' should normally be 'conj' but it is 'nsubj'.

In your log, I found 651 errors like:

[Line 167 Sent CF27-4 Node 2]: [L3 Syntax leaf-aux-cop] 'aux' not expected to have children (2:tiveram:aux --> 3:que:compound)
[Line 13 Sent CF28-1 Node 7]: [L3 Syntax leaf-aux-cop] 'aux' not expected to have children (7:começou:aux --> 8:a:compound)
[Line 7 Sent CF49-1 Node 2]: [L3 Syntax leaf-aux-cop] 'aux' not expected to have children (2:volta:aux --> 3:a:compound)
[Line 19 Sent CF49-3 Node 3]: [L3 Syntax leaf-aux-cop] 'aux' not expected to have children (3:volta:aux --> 4:a:compound)
[Line 105 Sent CF55-4 Node 9]: [L3 Syntax leaf-aux-cop] 'aux' not expected to have children (9:começa:aux --> 10:a:compound)
[Line 57 Sent CF92-5 Node 4]: [L3 Syntax leaf-aux-cop] 'aux' not expected to have children (4:começou:aux --> 5:a:compound)
[Line 54 Sent CF94-2 Node 12]: [L3 Syntax leaf-aux-cop] 'aux' not expected to have children (12:deixará:aux --> 13:de:compound)
...

report-bosque-alvelvis.log report-bosque-master.log

Could you please check or explain?

arademaker commented 4 years ago

Please consider that we have only 10 days for the next UD release! See https://cl.lingfil.uu.se/pipermail/ud/2019-October/000624.html

But the log provided by them is not produced from the workbench branch, they used the dev branch so please ignore it.

alvelvis commented 4 years ago

Hi, @arademaker This time we ran the validation script just for fixing fatal errors, thank you for running it and comparing the total number of errors.

Concerning the 651 errors with aux compound, it is a decision based on an extensive research on describing the Portuguese language. The solution we found is similar to that of the UD guidelines (https://universaldependencies.org/u/dep/compound.html), in which they label the word "up" in "put up" as a compound to "put". The problem is that, in our case, the word "a" in "começar a" is compound to an AUXiliary verb, not to a VERB, therefore the validation script sees it as an error. We suggest that the script skips checking if an AUX has compound children, for it is a phenomenom of our language which was already discussed (in Portuguese) in a Portuguese description conference: http://comcorhd.letras.puc-rio.br/recomecando-a-discutir-as-locucoes-verbais/

We think it is reasonable to change the validation script, but I'm not sure in which point, although I am quite certain that the script already skips checking for fixed children; so it should skip for compound, too.

Concerning the other ammount of errors, I'll open another pull request with as many solved errors as possible soon.

jnivre commented 4 years ago

A compound to an auxiliary sounds strange. With all due respect for the existing litteraturen, working in UD often requires us to depart from the existing descriptive tradition in order to achieve cross-linguistic consistency. Can you please describe the construction? Is it similar to French ”commencer á” (English ”begin to”)?

alvelvis commented 4 years ago

Hi @jnivre , thank you for the question.

For all grammarians of Portuguese, there is a consensus that in constructions like "começar a", “acabar de”, "voltar a" etc. the first verb is an auxiliary one, for it has lost its lexical meaning in order to become a grammatical particle that adds the value of aspect to the following verb.

For instance, “acabei de fazer” (lit. "finished to do") means “fiz agora/já fiz” (just did) and could not be paraphrased by “finalizei de fazer”/”encerrei de fazer” (lit. "finished to do"), marking that these are different verbs, "finalizar" has lexical meaning, and "acabar de", only grammatical.

So, we could follow English and French treebanks tagging the first verb as VERB, but we will be violating Portuguese grammar tradition and losing the auxiliar feature the first verb does have.

As of the preposition, we just noticed that in French-Sequoia it is ADP/mark, and in English-EWT, PART/mark. As to deprel, we reject that the "a" or “de” could be "case" or "mark". Not "case" because the preposition is not introducing any complement/adjunct, and prepositions do not link verbs/clauses. Not "mark" because, assuming there is this verbal phrase, there are not two clauses, but only one, being the first verb an AUX.

Besides, the word "a" is mandatory for the verb "começar" to become a grammatical particle, the same for the “de” in “acabar de”, and this is the reason we analyze it as an auxiliary multi-word expression (compound). Some grammarians even list VERB + ADP combinations as if they were just a unitary verb, which gave us the strength to consider them to be a single unit, a kind of verbal MWE. In our treebank, avoiding this issue impacts in uncertainty/inconsistency in xcomp, if we assume there are two VERBs.

dan-zeman commented 4 years ago

Hi, I fail to see what makes Portuguese finalizei de fazer so different from English finished to do that we would want to abandon cross-linguistic parallelism here? I understand that treating finalizei as a full VERB would go against the traditional Portuguese grammar but it seems to be just one of many places where traditional approach differs from the UD approach.

I do not know what you mean by uncertainty or inconsistency in xcomp. But if you want to preserve information that can be used to reconstruct the traditional-grammar view of things, you can define a language-specific extension of xcomp, e.g. xcomp:aspect.

jnivre commented 4 years ago

The question is whether Portuguese is really so different from the neighbouring Romance languages. The UD treebanks for French, Catalan and Spanish all treat this construction as a case of main verb + xcomp with mark. That is:

Jean commence à manger nsubj(commence, Jean) xcomp(commence, manger) mark(manger, à)

The whole point of UD is to make sure that similar constructions are annotated in the same way across languages, so that we can overcome the (more or less accidental) differences in descriptive traditions that have been such a serious obstacle to progress in cross-lingual research up until five years ago. This means we all have to give up some of our favourite analyses in the interest of cross-linguistic comparability, and simply referring to language-specific descriptions is not really a valid argument. The borderline between full verbs and auxiliary verbs is blurry in many languages, but for the other Romance languages (as well as several other language groups) the presence of an infinitive marker (like "a" or "de" in Romance, "to" in English, etc.) has been taken as evidence that the bleached verb should not be annotated as an auxiliary. In all of these languages, it is possible to find grammatical descriptions that draw the line elsewhere (and, for example, treat the bleached verb + infinitive marker as some kind of multiword expression). But in the interest of cross-linguistic consistency, which is the sole reason for UD in the first place, we must all strive to look beyond these language-specific descriptions. I therefore strongly urge you to remain consistent with what is done for other Romance languages.

arademaker commented 4 years ago

@claudiafreitas , do you want to add something?

claudiafreitas commented 4 years ago

Hi all, Dan, “finalizei de fazer” simply doesn’t exist in Portuguese – and it is not an issue related do use/collocational patterns.. It seems so weird (when compared to “acabei de fazer”) that it seems that there are completely different senses. Maybe a better example is “chegar a” (lit. “reach”)+ VERB , better translated as “even”:

chegou a recomendar um preço (lit. “reach to recommend a price”) --> even recommended a price chegou a desistir da prova (lit. “reach to gave up the competition”) --> even gave up chegou a afirmar que não iria --> even claimed that he would not go

In these cases, it is difficult – for us – to argue in favor of VERB / VERB...

Anyway, we totally agree that the main point in UD is to keep consistency between treebanks, and this involve understanding guidelines in the same way. Since there was room in the AUX section regarding specific language issues, we went on with the proposal. We just want to note that this issue is in a gray zone/no-man’s land in Portuguese description. And, actually, the whole thing began with the Portuguese construction “estar A V-INF”, which corresponds to the Brazilian “estar V-GER”, and what to do with the “a”, that doesn’t behave like a PREP... – and I must admit we feel a little bit uncomfortable with the combination ADP/mark, that is used only in this cases (in Portuguese). What about SCONJ/mark? But no problem at all in going back on the decision along the release corpus, or using xcomp:aspect as suggested by Dan.

PS: I really don’t believe it is possible to “naturally” overcome differences in descriptive traditions. I believe in negotiating differences. Personally, I’m very comfortable with the idea of UD as a second language, and our job is to find the best translation from our target language. And, as in any translation, there are losses and a good deal of interpretation : )

arademaker commented 4 years ago

Still not clear to me the argument

In our treebank, avoiding this issue impacts in uncertainty/inconsistency in xcomp, if we assume there are two VERBs.

From @claudiafreitas I didn’t understand why the fact that only some verbs work in these constructions is an argument against the way other treebanks are annotating similar constructions.

Yes, we don’t use “finalizei de fazer” and only “acabei de fazer”, but why is it an argument for making acabei AUX?

dan-zeman commented 4 years ago

Dan, “finalizei de fazer” simply doesn’t exist in Portuguese

Oops, sorry, I should have checked twice the paragraph I was copying the example from. So I actually wanted to say that I fail to see what makes Portuguese acabei de fazer so different from English finished to do.

vcvpaiva commented 4 years ago

hi all,

I am also not clear on the reasoning here, but it seems to me that what Claudia and Elvis are saying is that verbs like "acabar" (finish) and "comecar" (begin) should be considered AUXiliars. hence they need a verbal compound that has an auxiliar plus a preposition. I do not see why these verbs would be auxiliaries instead of normal verbs, as it's done in other Romance languages. I am no linguist, but these verbal expressions certainly seem to me very similar to "put up", "get on" etc.

Thanks Valeria

On Tue, Oct 22, 2019 at 1:12 PM Dan Zeman notifications@github.com wrote:

Dan, “finalizei de fazer” simply doesn’t exist in Portuguese

Oops, sorry, I should have checked twice the paragraph I was copying the example from. So I actually wanted to say that I fail to see what makes Portuguese acabei de fazer so different from English finished to do.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_Portuguese-Bosque/pull/271?email_source=notifications&email_token=AAIZ3H7XYRDP7EQPKZT2HB3QP5NB7A5CNFSM4JAIJJH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB7BF3A#issuecomment-545133292, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIZ3HY4C2OZHPFIXMGNN53QP5NB7ANCNFSM4JAIJJHQ .

-- Valeria de Paiva http://vcvpaiva.github.io/ http://www.cs.bham.ac.uk/~vdp/

alvelvis commented 4 years ago

Hi all, In order to get closer to the way other treebanks analyse this issue, we are going to send another pull request with VERB + VERB constructions. @arademaker , notice that this analysis is even different from the previous Bosque-UD 2.4 version, in which PALAVRAS analysed these constructions as verbal phrases (AUX + VERB).

We will leave it in the misc field a tag to mark that those constructions were once verbal phrases, according to the former parser and the descriptive tradition.

Thank you for the good discussion.