SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Closes #164 | Create dataset loader for OpenSLR #304

Closed MJonibek closed 8 months ago

MJonibek commented 9 months ago

Closes #164

Note: For subsets SLR35 and SLR36 checked on 2 out of 16 files, because they are very big (each file 1.1-1.4 GB).

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Checkbox

danjohnvelasco commented 8 months ago

Please run make check_file=seacrowd/sea_datasets/<dataset_name>/<dataset_name>.py to run the code formatter. I ran the code formatter on my side and it made some changes to your code.

MJonibek commented 8 months ago

@danjohnvelasco Done requested changes

danjohnvelasco commented 8 months ago

@danjohnvelasco Done requested changes

Thanks @MJonibek!

Just one more pending requested changes from me (I guess you missed it, so I'll just quote it down here):

Please add docstring containing short description of the dataset.

Once that's done, everything looks good to me.

Also, there's a suggestion from @holylovenia :

For the subset ids, should we include the language code for convenience, e.g., openslr_SLR35_jav_seacrowd_sptext?

I agree to include language code in the subset ids. Though, I get why it's not there in the first place, current instructions at template.py does not cover this case.

With that in mind, should we add an instruction at template.py that for datasets with language subsets, they must include language code in the subset ids? @holylovenia

holylovenia commented 8 months ago

Also, there's a suggestion from @holylovenia :

For the subset ids, should we include the language code for convenience, e.g., openslr_SLR35_jav_seacrowd_sptext?

I agree to include language code in the subset ids. Though, I get why it's not there in the first place, current instructions at template.py does not cover this case.

With that in mind, should we add an instruction at template.py that for datasets with language subsets, they must include language code in the subset ids? @holylovenia

Most of the time the subset id itself is either {task} or {task}_{lang} or {lang}, thus I think it's fine to keep the template.py as-is for now. This is my first time seeing a subset that diverges from that convention, @danjohnvelasco—not that it's a bad thing.

Let's add the {lang} to this dataloader's subset ids, e.g., openslr_SLR35_jav_seacrowd_sptext, @MJonibek.

MJonibek commented 8 months ago

@holylovenia @danjohnvelasco I added language code into subset id. Checked, should work fine.

MJonibek commented 8 months ago

@danjohnvelasco I am sorry, I somehow missed your comment about docstring. I added a docstring. Please let me know if I missed something :)