SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Create dataset loader for CMU Wilderness Multilingual Speech Dataset #343

Open SamuelCahyawijaya opened 5 months ago

SamuelCahyawijaya commented 5 months ago

Dataloader name: cmu_wilderness_multilingual_speech_dataset/cmu_wilderness_multilingual_speech_dataset.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?cmu_wilderness_multilingual_speech_dataset

Dataset cmu_wilderness_multilingual_speech_dataset
Description The CMU Wilderness Multilingual Speech Dataset is a speech dataset of aligned sentences and audio for around 700 different languages. It is based on readings of the New Testement from Bible.is. It provides data to allow building of kaldi ASR models, and Festvox TTS voices in the target languages.
Subsets -
Languages mhx, ifk, tlb, nod, ilo, frd, cgc, tha, cfm, bgr, blt, atq, dtp, cmr, amk, ptu, jav, lsi, nij, mhy, acn, prf, alj, lnd, kzf, pww, sda, mbb, ify, mbt, iba, pse, kje, gbi, mog, alp, twb, law, dni, ahk, rej, bcl, nlc, plw, zyp, lew, mad, txa, bpr, min, kne, agn, mqj, itv, gor, bts, twu, mwv, sml, npy, khm, sas, krj, ury, obo, kqe, mrw, ifb, mvp, cmo, por, xsb, ljp, bru, ban, ind, cnk, sgb, mak, nia, sun, hnn, ceb, btd, lao, pam, kac, ifa, blz, bps, ctd, mnb, pmf, hil, sxn, bep, ppk, mej, ace, ifu, tgl, lex, vie, btx, lhu, pag, xmm, bhz, tby
Tasks Automatic Speech Recognition, Text-To-Speech Synthesis
License Unknown (unknown)
Homepage http://festvox.org/cmu_wilderness/
HF URL -
Paper URL https://ieeexplore.ieee.org/document/8683536
akhdanfadh commented 4 months ago

It seems the speech data needs to be scraped from a particular website, and the official codebases for the paper do not count for the changed website structure. These links may be helpful:

holylovenia commented 3 months ago

It seems the speech data needs to be scraped from a particular website, and the official codebases for the paper do not count for the changed website structure. These links may be helpful: Opened issues on the repo https://github.com/festvox/datasets-CMU_Wilderness/issues/11 and https://github.com/festvox/datasets-CMU_Wilderness/issues/1 Reference scraper here, probably fixed it but have not yet tested

Thanks for inspecting this, @akhdanfadh! May I ask if you'll be able to check if the reference scraper fixed the problem or not (at least for the SEA languages)?

Also, it seems implementing this dataloader warrants a bonus since it's more complex than the others.

akhdanfadh commented 3 months ago

Got it @holylovenia, will do by Friday night.

holylovenia commented 3 months ago

Got it @holylovenia, will do by Friday night.

Thanks a lot, @akhdanfadh!!

akhdanfadh commented 3 months ago

After further observation, this problem was more about the dataset not being up-to-date, not just the outdated website scraper. The language ID used on the current Bible website does not match the LANGID used on the dataset website. For example, there are 3 LANGID for Indonesian dataset (INZNTV, INZSHL, INZTSI), but on the current Bible website for Indonesian, the codes are INDASV and INDTSI. With this, I think it will be difficult to implement the dataloader because inevitably someone has to match the existing dataset with the latest data on the website for all ASEAN languages.

@holylovenia @SamuelCahyawijaya @sabilmakbar

holylovenia commented 3 months ago

After further observation, this problem was more about the dataset not being up-to-date, not just the outdated website scraper. The language ID used on the current Bible website does not match the LANGID used on the dataset website. For example, there are 3 LANGID for Indonesian dataset (INZNTV, INZSHL, INZTSI), but on the current Bible website for Indonesian, the codes are INDASV and INDTSI. With this, I think it will be difficult to implement the dataloader because inevitably someone has to match the existing dataset with the latest data on the website for all ASEAN languages.

@holylovenia @SamuelCahyawijaya @sabilmakbar

Tough. I was looking through the dataset website too and it seems like they have outgrown CMU Wilderness dataset's coverage.

Have you taken a look at their API, @akhdanfadh? It seems like we should be able to access all of their data through the API.

holylovenia commented 3 months ago

May I know if there is any update on this, @akhdanfadh?

akhdanfadh commented 3 months ago

I haven't looked up on this, will do this week.

holylovenia commented 3 months ago

I haven't looked up on this, will do this week.

Sure! @yongzx will also help inspect this issue.

akhdanfadh commented 3 months ago

I am requesting the API key just now. This was what they said.

We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.

I will update here on anything related from their docs.

EDIT: Done @holylovenia 😵‍💫


API key is needed to access the data as a whole

An API key is required for development and production use. However, sometimes you just want to get a feel for what the data looks like, and plan how you will interact with the data. For that use only, a generic API key is provided as part of the collection. This key is rate-limited to 1000 requests per month.

We can actually explore them from Example Workflows. But if someone wants to access it through a dataloader, it is necessary to implement an API key input in the code later on.

Not all contents are available to download

From the API core concept:

The license allows for certain content to be downloaded for offline personal use within the application if the content is specifically marked as permitted for download within the API. The API indicates applicable content via the /download/list endpoint. This endpoint requires an API Key; any fileset from the resulting content list can be downloaded via the /download/:filesetid endpoint. Note that the content must remain within the application; the license allows the content to only be consumed by the application associated with the API Key.

Note the bold sentence. Not sure what application means.

Testing on INZNTV FilesetId

  1. Download the index files to reconstruct alignments: INZNTV.tar.gz provided in the dataset website.
  2. Extract and open INZNTV and you can get the full FilesetId: INZNTVN2DA. This will be used to access the API.
  3. Try to download the data using their API and got "403 Forbidden"

Tried testing it with the assumed most accessed data that is English with FilesetId EN1NIVN2DA and still got "403 Forbidden". I guess I am waiting for the requested API key.

Release date confusion

The released data for INZNTV id specified here (see the one with INDNTV id) is in mid-2021. But please note that the CMU dataset was released in March 2019. I haven't yet found any data versioning with the bible website API, so not sure if the data will match or not.

holylovenia commented 3 months ago

I am requesting the API key just now. This was what they said.

We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.

I will update here on anything related from their docs.

EDIT: Done @holylovenia 😵‍💫

Thanks a lot, @akhdanfadh! It seems that we will have to update some info for the corresponding datasheet too. 😵‍💫 Tagging @yongzx here too in case we need another pair of eyes for discussion and/or this dataloader implementation.

Note that the content must remain within the application.

I think it means that users are not permitted to upload the data to anywhere else. All usages should be done with the API and API key.

Release date confusion

By how things unfold, CMU Wilderness and the current Bible website seem to have different sets of datasets and distinct metadata. Let's follow the current Bible website since it's the one that provides the dataset now. We can even change the datasheet name and the dataloader name if needed.

cc: @SamuelCahyawijaya @sabilmakbar for your information.

akhdanfadh commented 2 months ago

I am requesting the API key just now. This was what they said.

We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.

I've got no update from their API key yet. Still waiting for further instructions. @holylovenia @yongzx

holylovenia commented 2 months ago

I am requesting the API key just now. This was what they said.

We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.

I've got no update from their API key yet. Still waiting for further instructions. @holylovenia @yongzx

Got it, there's nothing we can do without the API access for now. 👍 It seems unlikely we can use this dataset for the experiment as well.

If there's no response until the end of SEACrowd, I might add a note on the corresponding datasheet or deprecate it.

Thanks @akhdanfadh! Please keep us updated if there's some news.