[X] Confirm that this PR is linked to the dataset issue.
[X] Create the dataloader script nusantara/nusa_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[X] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _NUSANTARA_VERSION variables.
[X] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[X] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one NusantaraConfig for the source schema and one for a nusantara schema.
[X] Confirm dataloader script works with datasets.load_dataset function.
[X] Confirm that your dataloader script passes the test suite run with python -m tests.test_nusantara --path=nusantara/nusa_datasets/my_dataset/my_dataset.py.
[X] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
INFO:__main__:args: Namespace(path='nusantara/nusa_datasets/kamus_alay/kamus_alay.py', schema=None, subset_id=None, data_dir=None, use_auth_token=None)
INFO:__main__:self.PATH: nusantara/nusa_datasets/kamus_alay/kamus_alay.py
INFO:__main__:self.SUBSET_ID: kamus_alay
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module nusantara.nusa_datasets.kamus_alay.kamus_alay
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.PARAPHRASING: 'PARA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'T2T'}
INFO:__main__:schemas_to_check: {'T2T'}
INFO:__main__:Checking load_dataset with config name kamus_alay_source
Downloading and preparing dataset kamus_alay/kamus_alay_source to /Users/christianwbsn/.cache/huggingface/datasets/kamus_alay/kamus_alay_source/1.0.0/01ed3d791d194b2fe55784159b4db8d95499123a9b073b7cf5f8ae5610bbdccc...
Dataset kamus_alay downloaded and prepared to /Users/christianwbsn/.cache/huggingface/datasets/kamus_alay/kamus_alay_source/1.0.0/01ed3d791d194b2fe55784159b4db8d95499123a9b073b7cf5f8ae5610bbdccc. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 622.95it/s]
INFO:__main__:Checking load_dataset with config name kamus_alay_nusantara_t2t
Downloading and preparing dataset kamus_alay/kamus_alay_nusantara_t2t to /Users/christianwbsn/.cache/huggingface/datasets/kamus_alay/kamus_alay_nusantara_t2t/1.0.0/01ed3d791d194b2fe55784159b4db8d95499123a9b073b7cf5f8ae5610bbdccc...
Dataset kamus_alay downloaded and prepared to /Users/christianwbsn/.cache/huggingface/datasets/kamus_alay/kamus_alay_nusantara_t2t/1.0.0/01ed3d791d194b2fe55784159b4db8d95499123a9b073b7cf5f8ae5610bbdccc. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 717.22it/s]
WARNING:datasets.builder:Reusing dataset kamus_alay (/Users/christianwbsn/.cache/huggingface/datasets/kamus_alay/kamus_alay_source/1.0.0/01ed3d791d194b2fe55784159b4db8d95499123a9b073b7cf5f8ae5610bbdccc)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 813.80it/s]
INFO:__main__:Dataset sample [source]
{'slang': 'woww', 'formal': 'wow', 'is_in_dictionary': True, 'example': 'wow'}
WARNING:datasets.builder:Reusing dataset kamus_alay (/Users/christianwbsn/.cache/huggingface/datasets/kamus_alay/kamus_alay_nusantara_t2t/1.0.0/01ed3d791d194b2fe55784159b4db8d95499123a9b073b7cf5f8ae5610bbdccc)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 858.43it/s]
INFO:__main__:Dataset sample [nusantara_t2t]
{'id': '0', 'text_1': 'woww', 'text_2': 'wow', 'text_1_name': 'slang', 'text_2_name': 'formal'}
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 15006 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
id: 15006
text_1: 15006
text_2: 15006
text_1_name: 15006
text_2_name: 15006
.
----------------------------------------------------------------------
Ran 1 test in 4.505s
Checkbox
nusantara/nusa_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_NUSANTARA_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneNusantaraConfig
for the source schema and one for a nusantara schema.datasets.load_dataset
function.python -m tests.test_nusantara --path=nusantara/nusa_datasets/my_dataset/my_dataset.py
.