UniversalDependencies / UD_Portuguese-PUD

Parallel Universal Dependencies.
Other
5 stars 3 forks source link

lemmas and features #19

Open arademaker opened 4 years ago

arademaker commented 4 years ago

For verbs with Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres such as

 120 é AUX
  23 pode AUX
  21 está AUX
  20 está VERB
  18 tem VERB
  16 É AUX
  15 é VERB
  15 diz VERB
  13 tem AUX
  13 há VERB

they are all cases where it should have Person=2 (see https://www.conjugacao.com.br/verbo-ter/)

arademaker commented 4 years ago

commit abaac2b update the stats.xml

arademaker commented 4 years ago

commit f2b3d5a introduced lemmas using https://github.com/LFG-PTBR/MorphoBr. Commit 04988d1 add lemas for PROPN and PUNCT.

arademaker commented 4 years ago

the current status is:

  1. tokens with lemma:

    % awk '$0 ~ /^[0-9]/ && $3 !~ /_/' pt_pud-ud-test.conllu | wc -l
    13939   
  2. tokens still missing lemmas:

% awk '$1 ~ /^[0-9]+$/ && $3 ~ /_/' pt_pud-ud-test.conllu | wc -l
    9431  
% awk '$1 ~ /^[0-9]+$/ && $3 ~ /_/ {print $4}' pt_pud-ud-test.conllu | sort | uniq -c | sort -nr
2346 ADP
2070 DET
1496 VERB
 908 PRON
 844 AUX
 578 CCONJ
 471 NUM
 334 NOUN
 227 SCONJ
  94 ADJ
  40 SYM
  14 ADV
   8 X
   1 INTJ

Many problems in the verbs with errors in the features.

vcvpaiva commented 4 years ago

@arademaker

they are all cases where it should have Person=2

Are you sure? person=2? (tu, vos??) it doesn't make sense to me!

dan-zeman commented 4 years ago

As you know, I'm no speaker of Portuguese, but in the conjugation table you sent a link to, tem is listed as 3rd person singular present indicative, so the features Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres seem correct to me.

arademaker commented 4 years ago

In the case of tem, this form can be 3rd person singular simple present OR 2sd person singular imperative. In all 31 occurrences of tem in this corpus, all of them have feats Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres

What called my attention was the Aspect=Imp, in the example below we can change it to Aspect=Hab or remove it:

Ao longo da história, o mercado internacional de cabelo tem tido sempre uma dimensão política, diz Tarlo.

Previously, I made a mistake and mixed the Aspect=Imp witht the Mood=Imp ...

dan-zeman commented 4 years ago

OK. You don't seem to be using the Aspect feature in Bosque, so unless you think it will be useful to add it there too, removal is probably better.

arademaker commented 4 years ago

I am consulting @leoalenc about this. From the Wikipedia page, it seems we don't have aspects in Portuguese. Yes, we don't have them in Portuguese-Bosque nor in the Portuguese-GSD.

vcvpaiva commented 4 years ago

Alexandre, 2nd person singular imperative should be 'Tens' as in 'Tens que ir agora!' the 2nd person and the 3rd person are different, we use the third person as it was 2nd, but the grammar is always different, I think. Checking with Leonel is good, as he's the linguist.

arademaker commented 4 years ago

Not according to the https://www.conjugacao.com.br/verbo-ter/ and MorphoBr:

35123:tens  ter+V+PRS+2+SG
30781:tem   ter+V+IMP+2+SG
30782:tem   ter+V+PRS+3+SG
vcvpaiva commented 4 years ago

ok, my bad, then. this language of ours is totally crazy! imperativo positivo e negativos sao diferentes? ai, ai, ai

leoalenc commented 4 years ago

@arademaker and @vcvpaiva , the present tense as well as other tenses are usually used to make commands, as in the example cited by @vcvpaiva. In this case, the verb "ter" (have) in the second person singular present indicative ("tens") is used as a modal verb. The example is interpreted as a command because the modal verb expresses necessity. Personal trainers and gym instructors tend to use the past tense to make instructions to their clients:

Agora levantou!(Stand up now!)

This does not mean, however, that the form in question is imperative. Form and usage (function) shouldn't be mixed up. Imperative as other grammatical categories is just a label for a group of forms, not a description of language usage. For example, the present tense is usually used to express actions in the past in historical narratives.  it can also refer to future events:

Amanhã eu faço isso.(I'll do it tomorrow.)  It seems that all languages have similar form and function mismatches,e.g., infinitive forms as commands in German.  I strongly recommend to you logicians and mathematicians reading Paul Grice's seminal paper "Logic and conversation":

https://www.ucl.ac.uk/ls/studypacks/Grice-Logic.pdf

vcvpaiva commented 4 years ago

@leoalenc perfectly happy to believe you about the imperative of 'ter' in PT. still shocked about positive and negative imperative having different forms though!

but c'mon no gym instructor in my world says things like 'Levantou' and then it would be "stood up", as 'stand up' is totally present. meanwhile using the present tense to talk about the future is commonplace both in English and in Portuguese. and yes, I have read my Grice, this was not lack of Gricean-fu, but lack of grammar for imperative modes in Portuguese. Thanks!

arademaker commented 4 years ago

but c'mon no gym instructor in my world says things like 'Levantou' and then it would be "stood up", as 'stand up' is totally present.

I agree with @leoalenc. I hear people saying it. In the written form may be difficult to understand the use, but people do use it.

vcvpaiva commented 4 years ago

@leoalenc a bit of help on something else, please, if I may. we have all these issues with clitics in PT. the processing of PUD-PT thinks that we have 92 compound:prt (compound particles). these are all reflexive "se" particles as in:

newdoc id = n01013 sent_id = n01013005 text = O Sr. Osborne inscreveu-se na agência americana de oradores depois de ter sido despedido em julho. text_en = Mr Osborne signed up with a US speakers agency after being sacked in July.

"sign up with" gets a reflexive pronoun in PT as in "inscrever-se na(o)". it seems a bit of a non-necessary reflexivity of the verb. Someone might sign themselves or someone else into the US speakers agency. Do you think this is a good way of separating some clitics? and then we don't call the "se" a particle, but a pronoun, as we know that Mr Osborne signed himself up to the US speakers agency, so "se" is a direct object of "sign".

BUT in any case, I believe these are the only real compounds that we have in Portuguese, vestigial reflexive pronouns.

dan-zeman commented 4 years ago

Ad reflexive se: In cases where they are not true reflexives (in which case they would be obj or iobj) they have to be attached using a subtype of expl according to the UD guidelines (here and here). Although compound looks as a possibility at the first glance, it was rejected (as argued in Natalia Silveira's PhD thesis, page 136).

I have actually noticed the compound:prt in Portuguese PUD and I fixed them a couple of days ago, so they are now expl:pv (93 occurrences).

leoalenc commented 4 years ago

but c'mon no gym instructor in my world says things like 'Levantou' and then it would be "stood up", as 'stand up' is totally present.

I agree with @leoalenc. I hear people saying it. In the written form may be difficult to understand the use, but people do use it.

@arademaker @vcvpaiva this usage of the past tense occurs quite frequently in Brazilian Portuguese. Gym instructors use it all the time. I use it myself! it's quite natural to me in some situations. 😀 I said this to my daughter yesterday:

Parou! Parou! (Stop! Stop!)

Because she was driving the scooter to the street where a car was passing. I was not aware of this in my own usage. I only remembered the gym instructors. What I meant by referring to Paul Grice was that his article is the key to understand these usages. I also recommend this book based on his ideas: Pragmatics, by Stephen C. Levinson, Cambridge textbooks in linguistics  https://www.amazon.com/-/pt/dp/B00IE6MOZG

leoalenc commented 4 years ago

@leoalenc a bit of help on something else, please, if I may. we have all these issues with clitics in PT. the processing of PUD-PT thinks that we have 92 compound:prt (compound particles). these are all reflexive "se" particles as in:

newdoc id = n01013

sent_id = n01013005

text = O Sr. Osborne inscreveu-se na agência americana de oradores depois de ter sido despedido em julho.

text_en = Mr Osborne signed up with a US speakers agency after being sacked in July.

"sign up with" gets a reflexive pronoun in PT as in "inscrever-se na(o)". it seems a bit of a non-necessary reflexivity of the verb. Someone might sign themselves or someone else into the US speakers agency. Do you think this is a good way of separating some clitics?

and then we don't call the "se" a particle, but a pronoun, as we know that Mr Osborne signed himself up to the US speakers agency, so "se" is a direct object of "sign".

BUT in any case, I believe these are the only real compounds that we have in Portuguese, vestigial reflexive pronouns.

@vcvpaiva I agree with you in case of "inscrever-se": The reflexive clitic is direct object of the verb. there are cases, however, where the reflexive has been analyzed as a passive particle or as a sign of middle voice or inchoative/unaccusative/ergative usage:

1. Reformaram-se as escolas.

  1. Reformou-se a escola.
  2. O vaso quebrou-se.  Gramática da língua portuguesa, Maria Elena Mira Mateus et al.
vcvpaiva commented 4 years ago

thanks! At least in this corpus all

92 compound:prt (compound particles)

are what we would call clitics pronouns "se", I think. it is difficult to decide whether the lexical resource should list them with or without the "se", it's difficult to decide if the "se" comes before or after the main verb, but I think we don't call them "particles", but pronouns. is this correct?

I think perhaps only in horrible things like "Vendem-se casas" where "se" is used simply as a way of making the subject occult, maybe one could use 'particle' for that?

dan-zeman commented 4 years ago

I would be in favor of using always the PRON tag (with the feature Reflex=Yes), regardless of the fact that the word no longer functions as a true pronoun. The dependency relation label explains the function.

vcvpaiva commented 4 years ago

hey @dan-zeman did you notice the non-existing subordinate conjunctions in PUD-Spanish too? and the very small number of auxiliaries there? one third of the ones in Portuguese/French, seems wrong to me. thanks!

arademaker commented 4 years ago

commit 4e95fe9 adds more lemmas. Current count by UPOSTAG of tokens missing lemmas:

% awk '$1 ~ /^[0-9]+$/ && $3 ~ /_/ {print $4}' pt_pud-ud-test.conllu | sort | uniq -c | sort -nr
2345 ADP
2069 DET
1076 VERB
 902 PRON
 842 AUX
 578 CCONJ
 471 NUM
 329 NOUN
 227 SCONJ
  93 ADJ
  40 SYM
  14 ADV
   7 X
   1 INTJ

Note that many VERBS with incomplete analysis could not be matched to a MorphoBr entry.

Total of

% awk '$1 ~ /^[0-9]+$/ && $3 ~ /_/' pt_pud-ud-test.conllu | wc -l
    8994
dan-zeman commented 4 years ago

did you notice the non-existing subordinate conjunctions in PUD-Spanish too

@vcvpaiva : I will look into it. Thanks for the report!

arademaker commented 4 years ago

Once some lemmas of AUX were introduced in c6ff22b, many Morpho and Syntax errors appear.

Commit 8f686ac solves on such one case.

dan-zeman commented 4 years ago

Commits 36bceb4da9412c7aed3c60d5ac2b3f29db217296 to de34317855810f285bf1081ba7b3d969493a30b4 solve the rest of the newly discovered errors. All of them were pseudo-copular verbs that should not be treated as copulas in UD.

arademaker commented 4 years ago

Hi @dan-zeman can you share the code/query/rule you used to fix them? In my first fix I had an AUX linked to its HEAD as cop and need to change it to VERB, change its HEAD, and the deprel to acl. I wonder if the remaining cases are easier and how did you deal with them.

arademaker commented 4 years ago

Ops! I read now carefully your comment:

All of them were pseudo-copular verbs that should not be treated as copulas in UD.

So the question is how did you determine the new HEAD for the token?

dan-zeman commented 4 years ago

I wrote a new block for Udapi (see here). Then I called it with the lemma of the pseudocopula:

cat backup.conllu | udapy -s ud.FixPseudoCop lemma="tornar" > pt_pud-ud-test.conllu

The new parent is its original grandparent, while its original parent goes down as a secondary predicate. It is not always clear for the script, which children should stay with the secondary predicate and which should be re-attached to the pseudocopula because they modify the clause. So the result may not be always accurate in this respect.