Closes #164 | Create dataset loader for OpenSLR

MJonibek commented 9 months ago

Closes #164

Note: For subsets SLR35 and SLR36 checked on 2 out of 16 files, because they are very big (each file 1.1-1.4 GB).

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

danjohnvelasco commented 8 months ago

Please run make check_file=seacrowd/sea_datasets/<dataset_name>/<dataset_name>.py to run the code formatter. I ran the code formatter on my side and it made some changes to your code.

MJonibek commented 8 months ago

@danjohnvelasco Done requested changes

danjohnvelasco commented 8 months ago

@danjohnvelasco Done requested changes

Thanks @MJonibek!

Just one more pending requested changes from me (I guess you missed it, so I'll just quote it down here):

Please add docstring containing short description of the dataset.

Once that's done, everything looks good to me.

Also, there's a suggestion from @holylovenia :

For the subset ids, should we include the language code for convenience, e.g., openslr_SLR35_jav_seacrowd_sptext?

I agree to include language code in the subset ids. Though, I get why it's not there in the first place, current instructions at template.py does not cover this case.

With that in mind, should we add an instruction at template.py that for datasets with language subsets, they must include language code in the subset ids? @holylovenia

holylovenia commented 8 months ago

Also, there's a suggestion from @holylovenia :

For the subset ids, should we include the language code for convenience, e.g., openslr_SLR35_jav_seacrowd_sptext?

I agree to include language code in the subset ids. Though, I get why it's not there in the first place, current instructions at template.py does not cover this case.

With that in mind, should we add an instruction at template.py that for datasets with language subsets, they must include language code in the subset ids? @holylovenia

Most of the time the subset id itself is either {task} or {task}_{lang} or {lang}, thus I think it's fine to keep the template.py as-is for now. This is my first time seeing a subset that diverges from that convention, @danjohnvelasco—not that it's a bad thing.

Let's add the {lang} to this dataloader's subset ids, e.g., openslr_SLR35_jav_seacrowd_sptext, @MJonibek.

MJonibek commented 8 months ago

@holylovenia @danjohnvelasco I added language code into subset id. Checked, should work fine.

MJonibek commented 8 months ago

@danjohnvelasco I am sorry, I somehow missed your comment about docstring. I added a docstring. Please let me know if I missed something :)

SEACrowd / seacrowd-datahub

Closes #164 | Create dataset loader for OpenSLR #304

Checkbox