Closes #531 | Add Dataloader Multilingual-ALPACA

akhdanfadh commented 6 months ago

Closes #531

Similar to #556: I use third-party libraries to download the GDrive data, i.e., pip install gdown, because it is more reliable than the dl_manager. Similarly, I also store the downloaded data in data/multilingual_alpaca/. I am aware that I should make a PR for those two things, just waiting for further instructions.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

elyanah-aco commented 5 months ago

@akhdanfadh

Code itself looks good to me, just need to add try-except for third-party imports like in #556. Let's also follow that PR for what to do re: storing downloading data.

I found that the dataset comes from this paper linking to this Github repo, so you can change dataloader metadata accordingly.

@holylovenia Datasheet also needs to be updated with paper details.

akhdanfadh commented 5 months ago

@elyanah-aco Done!

holylovenia commented 5 months ago

@holylovenia Datasheet also needs to be updated with paper details.

@elyanah-aco You meant the paper title, or...?

akhdanfadh commented 5 months ago

@holylovenia see dataset's metadata I put on the dataloader. @elyanah-aco means updating the datasheet #531 by this paper and this homepage.

holylovenia commented 5 months ago

@holylovenia see dataset's metadata I put on the dataloader. @elyanah-aco means updating the datasheet #531 by this paper and this homepage.

Done! Thanks for letting me know, @elyanah-aco and @akhdanfadh.

SEACrowd / seacrowd-datahub

Closes #531 | Add Dataloader Multilingual-ALPACA #566

Checkbox