SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
57 stars 54 forks source link

Closes #524 | Add Dataloader HPLT #564

Closed akhdanfadh closed 2 months ago

akhdanfadh commented 3 months ago

Closes #524

I implemented one config per language+subset. Thus, configs will look like this: hplt_id_raw_source, hplt_my_cleaned_seacrowd_ssp, etc. When testing, pass hplt_<subset> to the --subset_id parameter.

Due to the huge size, it may take some time to download the data. For efficient testing, I suggest testing by copying some part of this code and using either of these subsets: my_cleaned (smallest seacrowd subset), cy_cleaned (smallest one-file-dataset), or sk_cleaned (smallest two-files-dataset) if you're interested.

Just a heads-up, I noticed vie language is included in the dataset's supported languages but not listed in datasheet #524. I haven't checked for other possible unlisted languages and also have not implemented that here in the dataloader. Just waiting for further instructions.

Checkbox

luckysusanto commented 3 months ago

Tested hplt_my_cleaned_source, Got: 238473 data instances, match the info from https://hplt-project.org/datasets/v1.2.

One nitpick on the language IDs, currently all the schemas are like these: ['hplt_id_raw_source', 'hplt_id_raw_seacrowd_ssp', 'hplt_id_deduplicated_source', 'hplt_id_deduplicated_seacrowd_ssp', 'hplt_id_cleaned_source', 'hplt_id_cleaned_seacrowd_ssp', 'hplt_ms_raw_source', 'hplt_ms_raw_seacrowd_ssp', 'hplt_ms_deduplicated_source', 'hplt_ms_deduplicated_seacrowd_ssp', 'hplt_ms_cleaned_source', 'hplt_ms_cleaned_seacrowd_ssp', 'hplt_th_raw_source', 'hplt_th_raw_seacrowd_ssp', 'hplt_th_deduplicated_source', 'hplt_th_deduplicated_seacrowd_ssp', 'hplt_th_cleaned_source', 'hplt_th_cleaned_seacrowd_ssp', 'hplt_my_raw_source', 'hplt_my_raw_seacrowd_ssp', 'hplt_my_deduplicated_source', 'hplt_my_deduplicated_seacrowd_ssp', 'hplt_my_cleaned_source', 'hplt_my_cleaned_seacrowd_ssp', 'hplt_tl_raw_source', 'hplt_tl_raw_seacrowd_ssp', 'hplt_tl_deduplicated_source', 'hplt_tl_deduplicated_seacrowd_ssp', 'hplt_tl_cleaned_source', 'hplt_tl_cleaned_seacrowd_ssp']

Can the language ID be changed to the 3 character ISO code? (id --> ind, etc)

Otherwise, the code runs well from what I tested!

akhdanfadh commented 2 months ago

@luckysusanto @jen-santoso Done addressing the ISO code for subset names, tested already on hplt_mya_cleaned. Please double-check.

Just a heads-up, I noticed vie language is included in the dataset's supported languages but not listed in datasheet #524. I haven't checked for other possible unlisted languages and also have not implemented that here in the dataloader. Just waiting for further instructions.

In case you missed it, how about this one?

jensan-1 commented 2 months ago

@luckysusanto @jen-santoso Done addressing the ISO code for subset names, tested already on hplt_mya_cleaned. Please double-check.

Just a heads-up, I noticed vie language is included in the dataset's supported languages but not listed in datasheet #524. I haven't checked for other possible unlisted languages and also have not implemented that here in the dataloader. Just waiting for further instructions.

In case you missed it, how about this one?

I think vie should be added as Vietnamese is included in SEA languages, yet it was not listed in the issue card. summon @SamuelCahyawijaya @holylovenia

holylovenia commented 2 months ago

I think vie should be added as Vietnamese is included in SEA languages, yet it was not listed in the issue card. summon @SamuelCahyawijaya @holylovenia

Added! Thanks @jen-santoso.