SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
57 stars 54 forks source link

Closes #535 | Add Dataloader CulturaY #559

Closed akhdanfadh closed 2 months ago

akhdanfadh commented 3 months ago

Closes #535

I implemented one config per language/subset. Thus, configs will look like this: culturay_id_source, culturay_my_seacrowd_ssp, etc. When testing, pass culturay_<subset> to the --subset_id parameter.

Due to the large dataset, it will take time to test.

Checkbox

khelli07 commented 2 months ago

Have you accept the acknowledgement or form in the https://huggingface.co/datasets/ontocord/CulturaY ? It seems to work on my end. It's just the subset are quite a few and need some time--and spaces! :( --to run.

akhdanfadh commented 2 months ago

Cannot access gated repo for URL https://huggingface.co/api/datasets/ontocord/CulturaY. Access to dataset ontocord/CulturaY is restricted and you are not in the authorized list. Visit https://huggingface.co/datasets/ontocord/CulturaY to ask for access.

Exactly as @khelli07 said, @MJonibek need to accept the acknowledgment in the dataset repo. I already describe it in the description section of the dataloader actually.

It's just the subset are quite a few and need some time--and spaces! :( --to run.

IKR, the dataset is huge. Not sure if there is something I can improve on this.

holylovenia commented 2 months ago

A friendly reminder for @akhdanfadh to address @MJonibek and @khelli07's reviews.

akhdanfadh commented 2 months ago

@MJonibek Done! Also, please add your review or accept if LGTY @khelli07