Closes #524 | Add Dataloader HPLT

akhdanfadh commented 3 months ago

Closes #524

I implemented one config per language+subset. Thus, configs will look like this: hplt_id_raw_source, hplt_my_cleaned_seacrowd_ssp, etc. When testing, pass hplt_<subset> to the --subset_id parameter.

Due to the huge size, it may take some time to download the data. For efficient testing, I suggest testing by copying some part of this code and using either of these subsets: my_cleaned (smallest seacrowd subset), cy_cleaned (smallest one-file-dataset), or sk_cleaned (smallest two-files-dataset) if you're interested.

Just a heads-up, I noticed vie language is included in the dataset's supported languages but not listed in datasheet #524. I haven't checked for other possible unlisted languages and also have not implemented that here in the dataloader. Just waiting for further instructions.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

luckysusanto commented 3 months ago

Tested hplt_my_cleaned_source, Got: 238473 data instances, match the info from https://hplt-project.org/datasets/v1.2.

One nitpick on the language IDs, currently all the schemas are like these: ['hplt_id_raw_source', 'hplt_id_raw_seacrowd_ssp', 'hplt_id_deduplicated_source', 'hplt_id_deduplicated_seacrowd_ssp', 'hplt_id_cleaned_source', 'hplt_id_cleaned_seacrowd_ssp', 'hplt_ms_raw_source', 'hplt_ms_raw_seacrowd_ssp', 'hplt_ms_deduplicated_source', 'hplt_ms_deduplicated_seacrowd_ssp', 'hplt_ms_cleaned_source', 'hplt_ms_cleaned_seacrowd_ssp', 'hplt_th_raw_source', 'hplt_th_raw_seacrowd_ssp', 'hplt_th_deduplicated_source', 'hplt_th_deduplicated_seacrowd_ssp', 'hplt_th_cleaned_source', 'hplt_th_cleaned_seacrowd_ssp', 'hplt_my_raw_source', 'hplt_my_raw_seacrowd_ssp', 'hplt_my_deduplicated_source', 'hplt_my_deduplicated_seacrowd_ssp', 'hplt_my_cleaned_source', 'hplt_my_cleaned_seacrowd_ssp', 'hplt_tl_raw_source', 'hplt_tl_raw_seacrowd_ssp', 'hplt_tl_deduplicated_source', 'hplt_tl_deduplicated_seacrowd_ssp', 'hplt_tl_cleaned_source', 'hplt_tl_cleaned_seacrowd_ssp']

Can the language ID be changed to the 3 character ISO code? (id --> ind, etc)

Otherwise, the code runs well from what I tested!

akhdanfadh commented 2 months ago

@luckysusanto @jen-santoso Done addressing the ISO code for subset names, tested already on hplt_mya_cleaned. Please double-check.

Just a heads-up, I noticed vie language is included in the dataset's supported languages but not listed in datasheet #524. I haven't checked for other possible unlisted languages and also have not implemented that here in the dataloader. Just waiting for further instructions.

In case you missed it, how about this one?

jensan-1 commented 2 months ago

@luckysusanto @jen-santoso Done addressing the ISO code for subset names, tested already on hplt_mya_cleaned. Please double-check.

Just a heads-up, I noticed vie language is included in the dataset's supported languages but not listed in datasheet #524. I haven't checked for other possible unlisted languages and also have not implemented that here in the dataloader. Just waiting for further instructions.

In case you missed it, how about this one?

I think vie should be added as Vietnamese is included in SEA languages, yet it was not listed in the issue card. summon @SamuelCahyawijaya @holylovenia

holylovenia commented 2 months ago

I think vie should be added as Vietnamese is included in SEA languages, yet it was not listed in the issue card. summon @SamuelCahyawijaya @holylovenia

Added! Thanks @jen-santoso.

SEACrowd / seacrowd-datahub

Closes #524 | Add Dataloader HPLT #564

Checkbox