SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
64 stars 57 forks source link

Create dataset loader for struct_amb_ind #267

Closed SamuelCahyawijaya closed 6 months ago

SamuelCahyawijaya commented 9 months ago

Dataloader name: struct_amb_ind/struct_amb_ind.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?struct_amb_ind

Dataset struct_amb_ind
Description This dataset contains the first Indonesian speech dataset for structurally ambiguous utterances and each of transcription and two disambiguation texts.
Subsets -
Languages ind
Tasks Automatic Speech Recognition
License Unknown (unknown)
Homepage https://github.com/ha3ci-lab/struct_amb_ind
HF URL -
Paper URL https://aclanthology.org/2023.emnlp-main.1045.pdf
jensan-1 commented 8 months ago

self-assign

jensan-1 commented 8 months ago

I need to contact the dataset provider. The dataset requires Git LFS to download (all the zip files in the speech folders are Git LFS pointers), but this error occured. image

jensan-1 commented 8 months ago

Hello all, I have tried contacting the dataset provider for struct_amb_ind, but there is no response. I think I will unassign myself from this task if the author does not respond within early next week.

jensan-1 commented 8 months ago

Hello all, I have tried contacting the dataset provider for struct_amb_ind, but there is no response. I think I will unassign myself from this task if the author does not respond within early next week.

The author does not respond to the Git LFS bandwidth problem. I am unassigning myself from this task, and might retake the task once I have the update for the problem.

holylovenia commented 8 months ago

Hi @jen-santoso, sorry for the late reply. @ruhiyahfw, the dataset owner, is looking into the problem causing this right now. Let's wait for an update from her for the time being. 🙏 Thanks for waiting!

holylovenia commented 6 months ago

Hi @jen-santoso, sorry for the late reply. @ruhiyahfw, the dataset owner, is looking into the problem causing this right now. Let's wait for an update from her for the time being. 🙏 Thanks for waiting!

Hi @ruhiyahfw, is there any update on this?

holylovenia commented 6 months ago

Due to some technical issues, the dataset owner can't push the data to the repo. However, she gave me access to the data via other means. Maybe we can treat it as a _LOCAL = True dataloader going forward for now, @jen-santoso? I'll send you the data URL via Discord.

jensan-1 commented 6 months ago

Thank you @holylovenia ! I will retake the ticket again!

jensan-1 commented 6 months ago

self-assign

SamuelCahyawijaya commented 6 months ago

So, how to get the data for the dataset? I think we can add this information to the dataloader as well (like for example: please contact xxx to get the access to the dataset). What do you guys think? @jen-santoso @holylovenia