UniversalDependencies / UD_Portuguese-PUD

Parallel Universal Dependencies.
Other
5 stars 3 forks source link

PRON or DET lacks the PronType feature #51

Open wellington36 opened 2 years ago

wellington36 commented 2 years ago

We have many DET and PRON cases without the PronType feature (initially 4217 cases), documented in https://universaldependencies.org/svalidation.html#pron-or-det-lacks-the-prontype-feature, found by command:

cat pt_pud-ud-test.conllu | udapy -TMA ud.MarkBugs tests='no-PronType' | less -R
vcvpaiva commented 2 years ago

Great to see that you're investigating PUD now! is this the first test you run on PUD?

wellington36 commented 2 years ago

Great to see that you're investigating PUD now! is this the first test you run on PUD?

I run cat pt_pud-ud-test.conllu | udapy -TMA ud.MarkBugs | less -R, that returned some errors, especially this one that most of the cases are determiners defined and can be solved via script.

wellington36 commented 2 years ago

@arademaker, I can fix most of the 4000 cases today via script, but this repo doesn't have the workbench branch, is possible create?

arademaker commented 2 years ago

Make a PR to DEV from a branch (named after an issue)

wellington36 commented 2 years ago

The code mentioned is this:

from conllu import parse_incr
from io import open

list_def_det = ['a', 'as', 'o', 'os', 'A', 'As', 'O', 'Os']

data_file = open("pt_pud-ud-test.conllu", "r", encoding="utf-8")

with open("pt_pud-ud-test.conllu", "r+") as f: 
    for token_list in parse_incr(data_file): 
        for token in token_list:
            if ((token['form'] in list_def_det) and 
                token['upos'] == 'DET' and 
                token['deprel'] == 'det' and
                any([i['upos'] in ['NOUN', 'PROPN'] for i in token_list if i['id'] == token['head']])):

                token['feats']['PronType'] = 'Art'             
        serialized = token_list.serialize()
        f.write(serialized)

It's important to discuss changes here, now it adds ProtoType=Art to all determiners with form=['a', 'as', 'o', 'os', 'A', 'As', 'O', 'Os' '] which total 2820, 7 already have it, so 2813 changes.

vcvpaiva commented 2 years ago

it adds ProtoType=Art to all determinants with form=['a', 'as', 'o', 'os', 'A', 'As', 'O', 'Os' '] which total 2820, guys, it's determiners, not determinants. matrices have determinants. grammars have determiners, I think. anyways, very thankful for the improvements.

On Sat, Oct 30, 2021 at 1:26 PM Wellington José Leite da Silva < @.***> wrote:

The code mentioned is this:

from conllu import parse_incrfrom io import open list_def_det = ['a', 'as', 'o', 'os', 'A', 'As', 'O', 'Os'] data_file = open("pt_pud-ud-test.conllu", "r", encoding="utf-8") with open("test.conllu", "w") as f: for token_list in parse_incr(data_file): for token in token_list: if (token['form'] in list_def_det and token['upos'] == 'DET'): token['feats']['PronType'] = 'Art' serialized = token_list.serialize() f.write(serialized) f.write('\n\n')

It's important to discuss changes here, now it adds ProtoType=Art to all determinants with form=['a', 'as', 'o', 'os', 'A', 'As', 'O', 'Os' '] which total 2820, 7 already have it, so 2813 changes.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_Portuguese-PUD/issues/51#issuecomment-955587972, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIZ3HY4GOZDBQ4PMLSRF23UJRIGNANCNFSM5HBJGFPQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

wellington36 commented 2 years ago

@arademaker. I updated the code, now, they correct this cases:

$ cat pt_pud-ud-test.conllu | udapy -q util.Eval node='if (node.form in ["a", "as", "o", "os", "A", "As", "O", "Os"] and node.upos == "DET" and node.deprel == "det" and node.parent.upos in ["NOUN", "PROPN"]): print(node)' | wc -l

2727

With token['deprel'] == 'det' checks the relationship with your parent and with any([i['upos'] in ['NOUN', 'PROPN'] for i in token_list if i['id'] == token['head']]) checks if the parent has a NOUN or PROPN.

arademaker commented 2 years ago

Please update the last comment with the code you used. Does the last PR closes this issue?

wellington36 commented 2 years ago

Please update the last comment with the code you used. Does the last PR closes this issue?

I update the code. Now, we have (1491 cases)

$ cat pt_pud-ud-test.conllu | udapy -TMA ud.MarkBugs tests='no-PronType' | less -R

        no-PronType       1491
              TOTAL       1491
alvelvis commented 11 months ago

I found these DET lemmas without PronType, which I'm fixing as follows:

lemma count solution
um 398 => PronType=Art|Definite=Ind
_ 261 => it's another issue, words without lemmas, already noticed in #19
o 93 => Definite=Def|PronType=Art
este 39 => PronType=Dem
aquele 7 => PronType=Dem
esse 4 => PronType=Dem
outro 1 => PronType=Ind
seu 1 => Poss=Yes|PronType=Prs

Since it's a big change I would love if someone could do a review of this PR #53

arademaker commented 11 months ago

I accepted the PR

arademaker commented 11 months ago

Tomorrow I will check if we can close this issue