Open SamuelCahyawijaya opened 5 months ago
It seems the speech data needs to be scraped from a particular website, and the official codebases for the paper do not count for the changed website structure. These links may be helpful: Opened issues on the repo https://github.com/festvox/datasets-CMU_Wilderness/issues/11 and https://github.com/festvox/datasets-CMU_Wilderness/issues/1 Reference scraper here, probably fixed it but have not yet tested
Thanks for inspecting this, @akhdanfadh! May I ask if you'll be able to check if the reference scraper fixed the problem or not (at least for the SEA languages)?
Also, it seems implementing this dataloader warrants a bonus since it's more complex than the others.
Got it @holylovenia, will do by Friday night.
Got it @holylovenia, will do by Friday night.
Thanks a lot, @akhdanfadh!!
After further observation, this problem was more about the dataset not being up-to-date, not just the outdated website scraper. The language ID used on the current Bible website does not match the LANGID used on the dataset website. For example, there are 3 LANGID for Indonesian dataset (INZNTV, INZSHL, INZTSI), but on the current Bible website for Indonesian, the codes are INDASV and INDTSI. With this, I think it will be difficult to implement the dataloader because inevitably someone has to match the existing dataset with the latest data on the website for all ASEAN languages.
@holylovenia @SamuelCahyawijaya @sabilmakbar
After further observation, this problem was more about the dataset not being up-to-date, not just the outdated website scraper. The language ID used on the current Bible website does not match the LANGID used on the dataset website. For example, there are 3 LANGID for Indonesian dataset (INZNTV, INZSHL, INZTSI), but on the current Bible website for Indonesian, the codes are INDASV and INDTSI. With this, I think it will be difficult to implement the dataloader because inevitably someone has to match the existing dataset with the latest data on the website for all ASEAN languages.
@holylovenia @SamuelCahyawijaya @sabilmakbar
Tough. I was looking through the dataset website too and it seems like they have outgrown CMU Wilderness dataset's coverage.
Have you taken a look at their API, @akhdanfadh? It seems like we should be able to access all of their data through the API.
May I know if there is any update on this, @akhdanfadh?
I haven't looked up on this, will do this week.
I haven't looked up on this, will do this week.
Sure! @yongzx will also help inspect this issue.
I am requesting the API key just now. This was what they said.
We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.
I will update here on anything related from their docs.
EDIT: Done @holylovenia 😵💫
An API key is required for development and production use. However, sometimes you just want to get a feel for what the data looks like, and plan how you will interact with the data. For that use only, a generic API key is provided as part of the collection. This key is rate-limited to 1000 requests per month.
We can actually explore them from Example Workflows. But if someone wants to access it through a dataloader, it is necessary to implement an API key input in the code later on.
From the API core concept:
The license allows for certain content to be downloaded for offline personal use within the application if the content is specifically marked as permitted for download within the API. The API indicates applicable content via the /download/list endpoint. This endpoint requires an API Key; any fileset from the resulting content list can be downloaded via the /download/:filesetid endpoint. Note that the content must remain within the application; the license allows the content to only be consumed by the application associated with the API Key.
Note the bold sentence. Not sure what application means.
FilesetId
FilesetId
: INZNTVN2DA. This will be used to access the API.Tried testing it with the assumed most accessed data that is English with FilesetId
EN1NIVN2DA and still got "403 Forbidden". I guess I am waiting for the requested API key.
The released data for INZNTV id specified here (see the one with INDNTV id) is in mid-2021. But please note that the CMU dataset was released in March 2019. I haven't yet found any data versioning with the bible website API, so not sure if the data will match or not.
I am requesting the API key just now. This was what they said.
We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.
I will update here on anything related from their docs.
EDIT: Done @holylovenia 😵💫
Thanks a lot, @akhdanfadh! It seems that we will have to update some info for the corresponding datasheet too. 😵💫 Tagging @yongzx here too in case we need another pair of eyes for discussion and/or this dataloader implementation.
Note that the content must remain within the application.
I think it means that users are not permitted to upload the data to anywhere else. All usages should be done with the API and API key.
Release date confusion
By how things unfold, CMU Wilderness and the current Bible website seem to have different sets of datasets and distinct metadata. Let's follow the current Bible website since it's the one that provides the dataset now. We can even change the datasheet name and the dataloader name if needed.
cc: @SamuelCahyawijaya @sabilmakbar for your information.
I am requesting the API key just now. This was what they said.
We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.
I've got no update from their API key yet. Still waiting for further instructions. @holylovenia @yongzx
I am requesting the API key just now. This was what they said.
We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.
I've got no update from their API key yet. Still waiting for further instructions. @holylovenia @yongzx
Got it, there's nothing we can do without the API access for now. 👍 It seems unlikely we can use this dataset for the experiment as well.
If there's no response until the end of SEACrowd, I might add a note on the corresponding datasheet or deprecate it.
Thanks @akhdanfadh! Please keep us updated if there's some news.
Dataloader name:
cmu_wilderness_multilingual_speech_dataset/cmu_wilderness_multilingual_speech_dataset.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?cmu_wilderness_multilingual_speech_dataset