Open arademaker opened 4 years ago
commit abaac2b update the stats.xml
commit f2b3d5a introduced lemmas using https://github.com/LFG-PTBR/MorphoBr. Commit 04988d1 add lemas for PROPN and PUNCT.
the current status is:
tokens with lemma:
% awk '$0 ~ /^[0-9]/ && $3 !~ /_/' pt_pud-ud-test.conllu | wc -l
13939
tokens still missing lemmas:
% awk '$1 ~ /^[0-9]+$/ && $3 ~ /_/' pt_pud-ud-test.conllu | wc -l
9431
% awk '$1 ~ /^[0-9]+$/ && $3 ~ /_/ {print $4}' pt_pud-ud-test.conllu | sort | uniq -c | sort -nr
2346 ADP
2070 DET
1496 VERB
908 PRON
844 AUX
578 CCONJ
471 NUM
334 NOUN
227 SCONJ
94 ADJ
40 SYM
14 ADV
8 X
1 INTJ
Many problems in the verbs with errors in the features.
@arademaker
they are all cases where it should have Person=2
Are you sure? person=2? (tu, vos??) it doesn't make sense to me!
As you know, I'm no speaker of Portuguese, but in the conjugation table you sent a link to, tem is listed as 3rd person singular present indicative, so the features Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres
seem correct to me.
In the case of tem
, this form can be 3rd person singular simple present OR 2sd person singular imperative. In all 31 occurrences of tem
in this corpus, all of them have feats Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres
What called my attention was the Aspect=Imp
, in the example below we can change it to Aspect=Hab
or remove it:
Ao longo da história, o mercado internacional de cabelo tem tido sempre uma dimensão política, diz Tarlo.
Previously, I made a mistake and mixed the Aspect=Imp
witht the Mood=Imp
...
OK. You don't seem to be using the Aspect
feature in Bosque, so unless you think it will be useful to add it there too, removal is probably better.
I am consulting @leoalenc about this. From the Wikipedia page, it seems we don't have aspects in Portuguese. Yes, we don't have them in Portuguese-Bosque nor in the Portuguese-GSD.
Alexandre, 2nd person singular imperative should be 'Tens' as in 'Tens que ir agora!' the 2nd person and the 3rd person are different, we use the third person as it was 2nd, but the grammar is always different, I think. Checking with Leonel is good, as he's the linguist.
Not according to the https://www.conjugacao.com.br/verbo-ter/ and MorphoBr:
35123:tens ter+V+PRS+2+SG
30781:tem ter+V+IMP+2+SG
30782:tem ter+V+PRS+3+SG
ok, my bad, then. this language of ours is totally crazy! imperativo positivo e negativos sao diferentes? ai, ai, ai
@arademaker and @vcvpaiva , the present tense as well as other tenses are usually used to make commands, as in the example cited by @vcvpaiva. In this case, the verb "ter" (have) in the second person singular present indicative ("tens") is used as a modal verb. The example is interpreted as a command because the modal verb expresses necessity. Personal trainers and gym instructors tend to use the past tense to make instructions to their clients:
Agora levantou!(Stand up now!)
This does not mean, however, that the form in question is imperative. Form and usage (function) shouldn't be mixed up. Imperative as other grammatical categories is just a label for a group of forms, not a description of language usage. For example, the present tense is usually used to express actions in the past in historical narratives.  it can also refer to future events:
Amanhã eu faço isso.(I'll do it tomorrow.)  It seems that all languages have similar form and function mismatches,e.g., infinitive forms as commands in German.  I strongly recommend to you logicians and mathematicians reading Paul Grice's seminal paper "Logic and conversation":
@leoalenc perfectly happy to believe you about the imperative of 'ter' in PT. still shocked about positive and negative imperative having different forms though!
but c'mon no gym instructor in my world says things like 'Levantou' and then it would be "stood up", as 'stand up' is totally present. meanwhile using the present tense to talk about the future is commonplace both in English and in Portuguese. and yes, I have read my Grice, this was not lack of Gricean-fu, but lack of grammar for imperative modes in Portuguese. Thanks!
but c'mon no gym instructor in my world says things like 'Levantou' and then it would be "stood up", as 'stand up' is totally present.
I agree with @leoalenc. I hear people saying it. In the written form may be difficult to understand the use, but people do use it.
@leoalenc a bit of help on something else, please, if I may. we have all these issues with clitics in PT. the processing of PUD-PT thinks that we have 92 compound:prt (compound particles). these are all reflexive "se" particles as in:
newdoc id = n01013 sent_id = n01013005 text = O Sr. Osborne inscreveu-se na agência americana de oradores depois de ter sido despedido em julho. text_en = Mr Osborne signed up with a US speakers agency after being sacked in July.
"sign up with" gets a reflexive pronoun in PT as in "inscrever-se na(o)". it seems a bit of a non-necessary reflexivity of the verb. Someone might sign themselves or someone else into the US speakers agency. Do you think this is a good way of separating some clitics? and then we don't call the "se" a particle, but a pronoun, as we know that Mr Osborne signed himself up to the US speakers agency, so "se" is a direct object of "sign".
BUT in any case, I believe these are the only real compounds that we have in Portuguese, vestigial reflexive pronouns.
Ad reflexive se: In cases where they are not true reflexives (in which case they would be obj
or iobj
) they have to be attached using a subtype of expl
according to the UD guidelines (here and here). Although compound
looks as a possibility at the first glance, it was rejected (as argued in Natalia Silveira's PhD thesis, page 136).
I have actually noticed the compound:prt
in Portuguese PUD and I fixed them a couple of days ago, so they are now expl:pv
(93 occurrences).
but c'mon no gym instructor in my world says things like 'Levantou' and then it would be "stood up", as 'stand up' is totally present.
I agree with @leoalenc. I hear people saying it. In the written form may be difficult to understand the use, but people do use it.
@arademaker @vcvpaiva this usage of the past tense occurs quite frequently in Brazilian Portuguese. Gym instructors use it all the time. I use it myself! it's quite natural to me in some situations. 😀 I said this to my daughter yesterday:
Parou! Parou! (Stop! Stop!)
Because she was driving the scooter to the street where a car was passing. I was not aware of this in my own usage. I only remembered the gym instructors. What I meant by referring to Paul Grice was that his article is the key to understand these usages. I also recommend this book based on his ideas: Pragmatics, by Stephen C. Levinson, Cambridge textbooks in linguistics  https://www.amazon.com/-/pt/dp/B00IE6MOZG
@leoalenc a bit of help on something else, please, if I may. we have all these issues with clitics in PT. the processing of PUD-PT thinks that we have 92 compound:prt (compound particles). these are all reflexive "se" particles as in:
newdoc id = n01013
sent_id = n01013005
text = O Sr. Osborne inscreveu-se na agência americana de oradores depois de ter sido despedido em julho.
text_en = Mr Osborne signed up with a US speakers agency after being sacked in July.
"sign up with" gets a reflexive pronoun in PT as in "inscrever-se na(o)". it seems a bit of a non-necessary reflexivity of the verb. Someone might sign themselves or someone else into the US speakers agency. Do you think this is a good way of separating some clitics?
and then we don't call the "se" a particle, but a pronoun, as we know that Mr Osborne signed himself up to the US speakers agency, so "se" is a direct object of "sign".
BUT in any case, I believe these are the only real compounds that we have in Portuguese, vestigial reflexive pronouns.
@vcvpaiva I agree with you in case of "inscrever-se": The reflexive clitic is direct object of the verb. there are cases, however, where the reflexive has been analyzed as a passive particle or as a sign of middle voice or inchoative/unaccusative/ergative usage:
1. Reformaram-se as escolas.
thanks! At least in this corpus all
92 compound:prt (compound particles)
are what we would call clitics pronouns "se", I think. it is difficult to decide whether the lexical resource should list them with or without the "se", it's difficult to decide if the "se" comes before or after the main verb, but I think we don't call them "particles", but pronouns. is this correct?
I think perhaps only in horrible things like "Vendem-se casas" where "se" is used simply as a way of making the subject occult, maybe one could use 'particle' for that?
I would be in favor of using always the PRON
tag (with the feature Reflex=Yes
), regardless of the fact that the word no longer functions as a true pronoun. The dependency relation label explains the function.
hey @dan-zeman did you notice the non-existing subordinate conjunctions in PUD-Spanish too? and the very small number of auxiliaries there? one third of the ones in Portuguese/French, seems wrong to me. thanks!
commit 4e95fe9 adds more lemmas. Current count by UPOSTAG of tokens missing lemmas:
% awk '$1 ~ /^[0-9]+$/ && $3 ~ /_/ {print $4}' pt_pud-ud-test.conllu | sort | uniq -c | sort -nr
2345 ADP
2069 DET
1076 VERB
902 PRON
842 AUX
578 CCONJ
471 NUM
329 NOUN
227 SCONJ
93 ADJ
40 SYM
14 ADV
7 X
1 INTJ
Note that many VERBS with incomplete analysis could not be matched to a MorphoBr entry.
Total of
% awk '$1 ~ /^[0-9]+$/ && $3 ~ /_/' pt_pud-ud-test.conllu | wc -l
8994
did you notice the non-existing subordinate conjunctions in PUD-Spanish too
@vcvpaiva : I will look into it. Thanks for the report!
Once some lemmas of AUX were introduced in c6ff22b, many Morpho and Syntax errors appear.
Commit 8f686ac solves on such one case.
Commits 36bceb4da9412c7aed3c60d5ac2b3f29db217296 to de34317855810f285bf1081ba7b3d969493a30b4 solve the rest of the newly discovered errors. All of them were pseudo-copular verbs that should not be treated as copulas in UD.
Hi @dan-zeman can you share the code/query/rule you used to fix them? In my first fix I had an AUX
linked to its HEAD as cop
and need to change it to VERB, change its HEAD, and the deprel to acl
. I wonder if the remaining cases are easier and how did you deal with them.
Ops! I read now carefully your comment:
All of them were pseudo-copular verbs that should not be treated as copulas in UD.
So the question is how did you determine the new HEAD for the token?
I wrote a new block for Udapi (see here). Then I called it with the lemma of the pseudocopula:
cat backup.conllu | udapy -s ud.FixPseudoCop lemma="tornar" > pt_pud-ud-test.conllu
The new parent is its original grandparent, while its original parent goes down as a secondary predicate. It is not always clear for the script, which children should stay with the secondary predicate and which should be re-attached to the pseudocopula because they modify the clause. So the result may not be always accurate in this respect.
For verbs with
Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres
such asthey are all cases where it should have
Person=2
(see https://www.conjugacao.com.br/verbo-ter/)