Open wellington36 opened 3 years ago
Great to see that you're investigating PUD now! is this the first test you run on PUD?
Great to see that you're investigating PUD now! is this the first test you run on PUD?
I run cat pt_pud-ud-test.conllu | udapy -TMA ud.MarkBugs | less -R
, that returned some errors, especially this one that most of the cases are determiners defined and can be solved via script.
@arademaker, I can fix most of the 4000 cases today via script, but this repo doesn't have the workbench branch, is possible create?
Make a PR to DEV from a branch (named after an issue)
The code mentioned is this:
from conllu import parse_incr
from io import open
list_def_det = ['a', 'as', 'o', 'os', 'A', 'As', 'O', 'Os']
data_file = open("pt_pud-ud-test.conllu", "r", encoding="utf-8")
with open("pt_pud-ud-test.conllu", "r+") as f:
for token_list in parse_incr(data_file):
for token in token_list:
if ((token['form'] in list_def_det) and
token['upos'] == 'DET' and
token['deprel'] == 'det' and
any([i['upos'] in ['NOUN', 'PROPN'] for i in token_list if i['id'] == token['head']])):
token['feats']['PronType'] = 'Art'
serialized = token_list.serialize()
f.write(serialized)
It's important to discuss changes here, now it adds ProtoType=Art to all determiners with form=['a', 'as', 'o', 'os', 'A', 'As', 'O', 'Os' '] which total 2820, 7 already have it, so 2813 changes.
it adds ProtoType=Art to all determinants with form=['a', 'as', 'o', 'os', 'A', 'As', 'O', 'Os' '] which total 2820, guys, it's determiners, not determinants. matrices have determinants. grammars have determiners, I think. anyways, very thankful for the improvements.
On Sat, Oct 30, 2021 at 1:26 PM Wellington José Leite da Silva < @.***> wrote:
The code mentioned is this:
from conllu import parse_incrfrom io import open list_def_det = ['a', 'as', 'o', 'os', 'A', 'As', 'O', 'Os'] data_file = open("pt_pud-ud-test.conllu", "r", encoding="utf-8") with open("test.conllu", "w") as f: for token_list in parse_incr(data_file): for token in token_list: if (token['form'] in list_def_det and token['upos'] == 'DET'): token['feats']['PronType'] = 'Art' serialized = token_list.serialize() f.write(serialized) f.write('\n\n')
It's important to discuss changes here, now it adds ProtoType=Art to all determinants with form=['a', 'as', 'o', 'os', 'A', 'As', 'O', 'Os' '] which total 2820, 7 already have it, so 2813 changes.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/UD_Portuguese-PUD/issues/51#issuecomment-955587972, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIZ3HY4GOZDBQ4PMLSRF23UJRIGNANCNFSM5HBJGFPQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@arademaker. I updated the code, now, they correct this cases:
$ cat pt_pud-ud-test.conllu | udapy -q util.Eval node='if (node.form in ["a", "as", "o", "os", "A", "As", "O", "Os"] and node.upos == "DET" and node.deprel == "det" and node.parent.upos in ["NOUN", "PROPN"]): print(node)' | wc -l
2727
With token['deprel'] == 'det'
checks the relationship with your parent and with any([i['upos'] in ['NOUN', 'PROPN'] for i in token_list if i['id'] == token['head']])
checks if the parent has a NOUN or PROPN.
Please update the last comment with the code you used. Does the last PR closes this issue?
Please update the last comment with the code you used. Does the last PR closes this issue?
I update the code. Now, we have (1491 cases)
$ cat pt_pud-ud-test.conllu | udapy -TMA ud.MarkBugs tests='no-PronType' | less -R
no-PronType 1491
TOTAL 1491
I found these DET lemmas without PronType, which I'm fixing as follows:
lemma | count | solution |
---|---|---|
um | 398 | => PronType=Art|Definite=Ind |
_ | 261 | => it's another issue, words without lemmas, already noticed in #19 |
o | 93 | => Definite=Def|PronType=Art |
este | 39 | => PronType=Dem |
aquele | 7 | => PronType=Dem |
esse | 4 | => PronType=Dem |
outro | 1 | => PronType=Ind |
seu | 1 | => Poss=Yes|PronType=Prs |
Since it's a big change I would love if someone could do a review of this PR #53
I accepted the PR
Tomorrow I will check if we can close this issue
We have many DET and PRON cases without the PronType feature (initially 4217 cases), documented in https://universaldependencies.org/svalidation.html#pron-or-det-lacks-the-prontype-feature, found by command: