Closed VanillaMacchiato closed 2 years ago
/test dataset=identic
Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3042564126
Hi @holylovenia , I've completed the requested changes. Thank you!
@VanillaMacchiato, thanks for contributing! We'd only expect the MT dataset from IDENTIC before, but it seems we can get more out of it. Approving this PR!
Hi @SamuelCahyawijaya, it is true that some POS tags are not properly labeled from the source. I've extracted every possible tag and fed it into TAGSETS
variable. For instance, ^ke+dua
that appeared in line 436032 of id.npp.conll
file:
15 kedua ^ke+dua ^ke+dua ^ke+dua |||||1|15 0 _ _ _
.
One of the lines that have a proper tag (The tag is R--
):
2 untuk untuk untuk<r>_R-- R-- r|R|-|-|untuk|0|- 0 _ _ _
.
The solution that come to my mind is to figure out the meaning of every bugged tag by looking for the corresponding sample and then map it into the possibly correct one, which I'm working on. Is it suitable?
Thanks!
Hi @VanillaMacchiato, is there any update on the dataset?
Hi @SamuelCahyawijaya, sorry for the late follow-up! I've updated the dataloader as requested. Thanks!
/test dataset=identic
Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3167594641
/test dataset=identic subset_id=identic_noclitic
Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3167628441
Okay, removed it! Sorry for the forced push due to a typo in the commit message
Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.
Checkbox
nusantara/nusa_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_NUSANTARA_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneNusantaraConfig
for the source schema and one for a nusantara schema.datasets.load_dataset
function.python -m tests.test_nusantara --path=nusantara/nusa_datasets/my_dataset/my_dataset.py
.