IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
261 stars 61 forks source link

Closes #276 | Create dataset loader for INDspeech_NEWSTRA_EthnicSR #292

Closed IvanHalimP closed 1 year ago

IvanHalimP commented 1 year ago

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Closes #276

Checkbox

I have several questions:

  1. The repo has 2 datasets. Which one should I use? This initial PR uses only 'dataset1' as it is the closest to the description in the card
  2. There are 4 languages (Batak, Bali, Sunda, and Jawa), the language code I use is based on the abbreviation in the dataset. Is that acceptable or should I follow a certain standard?
  3. Regarding the train/test split, the card mentions that there are 9000/4000 samples. I do think that it refers to all the languages combined excluding 'dataset2'.
  4. The dataset2 has another 1600/50 samples per language. However, if I include them into the loader, it won't match the description in the card.

That's all my concerns so far. Looking forward for any guide. Thanks in advance.

IvanHalimP commented 1 year ago

/test dataset=indspeech_newstra_ethnicsr subset_id=indspeech_newstra_ethnicsr_Jaw

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3118214419

IvanHalimP commented 1 year ago

/test dataset=indspeech_newstra_ethnicsr subset_id=indspeech_newstra_ethnicsr_jaw

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3163659870

IvanHalimP commented 1 year ago

/test dataset=indspeech_newstra_ethnicsr subset_id=indspeech_newstra_ethnicsr_jaw

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3163683151

IvanHalimP commented 1 year ago

oh wait. I just saw @SamuelCahyawijaya's reply on slack... Will upd tmr.

IvanHalimP commented 1 year ago

/test dataset=indspeech_newstra_ethnicsr subset_id=indspeech_newstra_ethnicsr_overlap_btk

IvanHalimP commented 1 year ago

/test dataset=indspeech_newstra_ethnicsr subset_id=indspeech_newstra_ethnicsr_nooverlap_sun

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3167213062

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3167212587