Closes #535 | Add Dataloader CulturaY

akhdanfadh commented 3 months ago

Closes #535

I implemented one config per language/subset. Thus, configs will look like this: culturay_id_source, culturay_my_seacrowd_ssp, etc. When testing, pass culturay_<subset> to the --subset_id parameter.

Due to the large dataset, it will take time to test.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

khelli07 commented 2 months ago

Have you accept the acknowledgement or form in the https://huggingface.co/datasets/ontocord/CulturaY ? It seems to work on my end. It's just the subset are quite a few and need some time--and spaces! :( --to run.

akhdanfadh commented 2 months ago

Cannot access gated repo for URL https://huggingface.co/api/datasets/ontocord/CulturaY. Access to dataset ontocord/CulturaY is restricted and you are not in the authorized list. Visit https://huggingface.co/datasets/ontocord/CulturaY to ask for access.

Exactly as @khelli07 said, @MJonibek need to accept the acknowledgment in the dataset repo. I already describe it in the description section of the dataloader actually.

It's just the subset are quite a few and need some time--and spaces! :( --to run.

IKR, the dataset is huge. Not sure if there is something I can improve on this.

holylovenia commented 2 months ago

A friendly reminder for @akhdanfadh to address @MJonibek and @khelli07's reviews.

akhdanfadh commented 2 months ago

@MJonibek Done! Also, please add your review or accept if LGTY @khelli07

SEACrowd / seacrowd-datahub

Closes #535 | Add Dataloader CulturaY #559

Checkbox