Closes #339 | Update dataloader for Leipzig

TysonYu commented 7 months ago

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

holylovenia commented 6 months ago

A friendly reminder to follow up, @TysonYu @raileymontalan.

TysonYu commented 6 months ago

Hi @TysonYu, could you please fix the folder name to leipzig_corpora (i.e. seacrowd/sea_datasets/leipzig_corpora/leipzig_corpora.py? And provide per-language subsets. Other than that, the code LGTM. Thanks! Done~

holylovenia commented 5 months ago

Hi @raileymontalan and @SamuelCahyawijaya, I changed the "copora" to "corpora". Please feel free to let @TysonYu know if other changes are required.

raileymontalan commented 5 months ago

Hi @TysonYu, are you working on creating subsets per language, as per @SamuelCahyawijaya's request?

holylovenia commented 5 months ago

Hi @TysonYu, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.

holylovenia commented 4 months ago

Hi @TysonYu, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

holylovenia commented 3 months ago

Hi @TysonYu, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

cc: @SamuelCahyawijaya @raileymontalan

SEACrowd / seacrowd-datahub

Closes #339 | Update dataloader for Leipzig #483

Checkbox