SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Closes #531 | Add Dataloader Multilingual-ALPACA #566

Closed akhdanfadh closed 5 months ago

akhdanfadh commented 6 months ago

Closes #531

Similar to #556: I use third-party libraries to download the GDrive data, i.e., pip install gdown, because it is more reliable than the dl_manager. Similarly, I also store the downloaded data in data/multilingual_alpaca/. I am aware that I should make a PR for those two things, just waiting for further instructions.

Checkbox

elyanah-aco commented 5 months ago

@akhdanfadh

Code itself looks good to me, just need to add try-except for third-party imports like in #556. Let's also follow that PR for what to do re: storing downloading data.

I found that the dataset comes from this paper linking to this Github repo, so you can change dataloader metadata accordingly.

@holylovenia Datasheet also needs to be updated with paper details.

akhdanfadh commented 5 months ago

@elyanah-aco Done!

holylovenia commented 5 months ago

@holylovenia Datasheet also needs to be updated with paper details.

@elyanah-aco You meant the paper title, or...?

akhdanfadh commented 5 months ago

@holylovenia see dataset's metadata I put on the dataloader. @elyanah-aco means updating the datasheet #531 by this paper and this homepage.

holylovenia commented 5 months ago

@holylovenia see dataset's metadata I put on the dataloader. @elyanah-aco means updating the datasheet #531 by this paper and this homepage.

Done! Thanks for letting me know, @elyanah-aco and @akhdanfadh.