SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Create dataset loader for Lio and the Central Flores languages #312

Closed SamuelCahyawijaya closed 5 months ago

SamuelCahyawijaya commented 9 months ago

Dataloader name: lio_and_central_flores/lio_and_central_flores.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?lio_and_central_flores

Dataset lio_and_central_flores
Description This dataset is a collection of language resources of Li'o, Ende, Nage, and So'a which are collected in Ende, Flores, Eastern Nusa Tenggara. This dataset is the dataset from the research MA thesis by Alexander Elias. Title: Lio and the Central Flores languages
Subsets Lio Collection
Languages end, nxe, ssq, ljl, eng
Tasks Automatic Speech Recognition, Machine Translation
License Unknown (unknown)
Homepage https://archive.mpi.nl/tla/islandora/search/alexander%20elias?type=dismax&islandora_solr_search_navigation=0&f%5B0%5D=cmd.Contributor%3A%22Alexander%5C%20Elias%22
HF URL -
Paper URL https://studenttheses.universiteitleiden.nl/handle/1887/69452
joanitolopo commented 9 months ago

self-assign

joanitolopo commented 8 months ago

Hi! I have a question regarding this dataset. Would you mind that should we separate the task data loaders within this dataset for the sake of simplicity?: Speeh Recognition and Machine Translation. If not, could you please share a reference that has implemented two or more tasks in a single data loader? Thanks!

holylovenia commented 8 months ago

Hi! I have a question regarding this dataset. Would you mind that should we separate the task data loaders within this dataset for the sake of simplicity?: Speeh Recognition and Machine Translation. If not, could you please share a reference that has implemented two or more tasks in a single data loader? Thanks!

Hi @joanitolopo, thank you for taking on this dataloader. Could we have multiple subsets instead of multiple dataloaders?

seacrowd subsets

  1. lio_and_central_flores_asr_{lang}_seacrowd_sptext for all of the SEA languages
  2. lio_and_central_flores_mt_{lang}_seacrowd_sptext for all of the SEA languages

source subsets

  1. lio_and_central_flores_asr_{lang}_source for all of the SEA languages
  2. lio_and_central_flores_mt_{lang}_source for all of the SEA languages
joanitolopo commented 8 months ago

HI @holylovenia.

Could we have multiple subsets instead of multiple dataloaders?

I assumed that we have 16 configs for each language because there are four languages and two task.

For seacrowdsubsets, i used lio_and_central_flores_asr_{lang}_seacrowd_sptext for ASR task and lio_and_central_flores_mt_{lang}_seacrowd_t2t for MT task. Am i right? Thankyou

github-actions[bot] commented 7 months ago

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

holylovenia commented 7 months ago

I assumed that we have 16 configs for each language because there are four languages and two task.

For seacrowdsubsets, i used lio_and_central_flores_asr_{lang}_seacrowd_sptext for ASR task and lio_and_central_flores_mt_{lang}_seacrowd_t2t for MT task. Am i right? Thankyou

Yes. For the MT, could you please use lio_and_central_flores_mt_eng_{lang}_seacrowd_t2t instead of lio_and_central_flores_mt_{lang}_seacrowd_t2t? Just for clarity's sake.

Sorry for the late reply.

holylovenia commented 7 months ago

Adding top-priority and bonus+ labels because we would need this for the experiments.

holylovenia commented 6 months ago

Hi @joanitolopo, may I know if you need any help with the dataloader?

SamuelCahyawijaya commented 6 months ago

Hi @holylovenia, I had a discussion with @joanitolopo earlier, and it seems like it is almost impossible to create a useful ASR dataset from this data because the video (audio) is impossible to align because there is no clear timestamp, the audio is noisy, and even sometimes repetitive.

Nonetheless, I think we can keep the machine translation task, as there are source-to-english sentence pairs provided in the transcription file.

holylovenia commented 6 months ago

Hi @holylovenia, I had a discussion with @joanitolopo earlier, and it seems like it is almost impossible to create a useful ASR dataset from this data because the video (audio) is impossible to align because there is no clear timestamp, the audio is noisy, and even sometimes repetitive.

Nonetheless, I think we can keep the machine translation task, as there are source-to-english sentence pairs provided in the transcription file.

Noted, thanks @joanitolopo @SamuelCahyawijaya! But I'll keep the datasheet as-is with ASR and MT tasks since the dataset provides the resources needed for these tasks—albeit with additional postprocessing steps.