Closed akhdanfadh closed 2 months ago
Tested hplt_my_cleaned_source, Got: 238473 data instances, match the info from https://hplt-project.org/datasets/v1.2.
One nitpick on the language IDs, currently all the schemas are like these: ['hplt_id_raw_source', 'hplt_id_raw_seacrowd_ssp', 'hplt_id_deduplicated_source', 'hplt_id_deduplicated_seacrowd_ssp', 'hplt_id_cleaned_source', 'hplt_id_cleaned_seacrowd_ssp', 'hplt_ms_raw_source', 'hplt_ms_raw_seacrowd_ssp', 'hplt_ms_deduplicated_source', 'hplt_ms_deduplicated_seacrowd_ssp', 'hplt_ms_cleaned_source', 'hplt_ms_cleaned_seacrowd_ssp', 'hplt_th_raw_source', 'hplt_th_raw_seacrowd_ssp', 'hplt_th_deduplicated_source', 'hplt_th_deduplicated_seacrowd_ssp', 'hplt_th_cleaned_source', 'hplt_th_cleaned_seacrowd_ssp', 'hplt_my_raw_source', 'hplt_my_raw_seacrowd_ssp', 'hplt_my_deduplicated_source', 'hplt_my_deduplicated_seacrowd_ssp', 'hplt_my_cleaned_source', 'hplt_my_cleaned_seacrowd_ssp', 'hplt_tl_raw_source', 'hplt_tl_raw_seacrowd_ssp', 'hplt_tl_deduplicated_source', 'hplt_tl_deduplicated_seacrowd_ssp', 'hplt_tl_cleaned_source', 'hplt_tl_cleaned_seacrowd_ssp']
Can the language ID be changed to the 3 character ISO code? (id --> ind, etc)
Otherwise, the code runs well from what I tested!
@luckysusanto @jen-santoso Done addressing the ISO code for subset names, tested already on hplt_mya_cleaned
. Please double-check.
Just a heads-up, I noticed
vie
language is included in the dataset's supported languages but not listed in datasheet #524. I haven't checked for other possible unlisted languages and also have not implemented that here in the dataloader. Just waiting for further instructions.
In case you missed it, how about this one?
@luckysusanto @jen-santoso Done addressing the ISO code for subset names, tested already on
hplt_mya_cleaned
. Please double-check.Just a heads-up, I noticed
vie
language is included in the dataset's supported languages but not listed in datasheet #524. I haven't checked for other possible unlisted languages and also have not implemented that here in the dataloader. Just waiting for further instructions.In case you missed it, how about this one?
I think vie
should be added as Vietnamese is included in SEA languages, yet it was not listed in the issue card.
summon @SamuelCahyawijaya @holylovenia
I think
vie
should be added as Vietnamese is included in SEA languages, yet it was not listed in the issue card. summon @SamuelCahyawijaya @holylovenia
Added! Thanks @jen-santoso.
Closes #524
I implemented one config per language+subset. Thus, configs will look like this:
hplt_id_raw_source
,hplt_my_cleaned_seacrowd_ssp
, etc. When testing, passhplt_<subset>
to the--subset_id
parameter.Due to the huge size, it may take some time to download the data. For efficient testing, I suggest testing by copying some part of this code and using either of these subsets:
my_cleaned
(smallest seacrowd subset),cy_cleaned
(smallest one-file-dataset), orsk_cleaned
(smallest two-files-dataset) if you're interested.Just a heads-up, I noticed
vie
language is included in the dataset's supported languages but not listed in datasheet #524. I haven't checked for other possible unlisted languages and also have not implemented that here in the dataloader. Just waiting for further instructions.Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.