IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
260 stars 62 forks source link

Closes #41 | Create dataset loader for Identic #255

Closed VanillaMacchiato closed 2 years ago

VanillaMacchiato commented 2 years ago

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Checkbox

christianwbsn commented 2 years ago

/test dataset=identic

github-actions[bot] commented 2 years ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3042564126

VanillaMacchiato commented 2 years ago

Hi @holylovenia , I've completed the requested changes. Thank you!

SamuelCahyawijaya commented 2 years ago

@VanillaMacchiato, thanks for contributing! We'd only expect the MT dataset from IDENTIC before, but it seems we can get more out of it. Approving this PR!

VanillaMacchiato commented 2 years ago

Hi @SamuelCahyawijaya, it is true that some POS tags are not properly labeled from the source. I've extracted every possible tag and fed it into TAGSETS variable. For instance, ^ke+dua that appeared in line 436032 of id.npp.conll file:

15 kedua ^ke+dua ^ke+dua ^ke+dua |||||1|15 0 _ _ _.

One of the lines that have a proper tag (The tag is R--):

2 untuk untuk untuk<r>_R-- R-- r|R|-|-|untuk|0|- 0 _ _ _ .

The solution that come to my mind is to figure out the meaning of every bugged tag by looking for the corresponding sample and then map it into the possibly correct one, which I'm working on. Is it suitable?

Thanks!

SamuelCahyawijaya commented 2 years ago

Hi @VanillaMacchiato, is there any update on the dataset?

VanillaMacchiato commented 2 years ago

Hi @SamuelCahyawijaya, sorry for the late follow-up! I've updated the dataloader as requested. Thanks!

holylovenia commented 2 years ago

/test dataset=identic

github-actions[bot] commented 2 years ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3167594641

holylovenia commented 2 years ago

/test dataset=identic subset_id=identic_noclitic

github-actions[bot] commented 2 years ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3167628441

VanillaMacchiato commented 2 years ago

Okay, removed it! Sorry for the forced push due to a typo in the commit message