SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Closes #274 | Create OIL data loader #389

Closed Enliven26 closed 7 months ago

Enliven26 commented 8 months ago

Closes #274

Checkbox

MJonibek commented 8 months ago

@Enliven26 I checked the dataloader. Works fine. However, I am not sure does it fulfill the purpose of SPEECH_RECOGNITION task. Because in this dataset every audio instead of the transcript has "label" (that is just id). Maybe we need to find where transcripts are stored (they are not on hugging face) or change the task.

@jen-santoso What do you think?

Enliven26 commented 7 months ago

Hi @MJonibek , may I know if the transcript is found? Thanks

jensan-1 commented 7 months ago

@Enliven26 I checked the dataloader. Works fine. However, I am not sure does it fulfill the purpose of SPEECH_RECOGNITION task. Because in this dataset every audio instead of the transcript has "label" (that is just id). Maybe we need to find where transcripts are stored (they are not on hugging face) or change the task.

@jen-santoso What do you think?

Sorry for the late notice. I think we should try to ask the authors about the transcripts (whether they are available or not), as it is weird to have ASR task dataset without any transcriptions for the audio.... @SamuelCahyawijaya @holylovenia @sabilmakbar what do you think?

In the meantime, I will review the dataloader using the available information.

Enliven26 commented 7 months ago

Hi, changed the citation and added the init file. Is there any other necessary changes? Thanks

jensan-1 commented 7 months ago

@Enliven26 Thanks for the updates! we are still currently working on finding the transcripts. the current changes LGTM, but we will need to inform you later depending on the circumstances with the transcripts.

holylovenia commented 7 months ago

@Enliven26 I checked the dataloader. Works fine. However, I am not sure does it fulfill the purpose of SPEECH_RECOGNITION task. Because in this dataset every audio instead of the transcript has "label" (that is just id). Maybe we need to find where transcripts are stored (they are not on hugging face) or change the task. @jen-santoso What do you think?

Sorry for the late notice. I think we should try to ask the authors about the transcripts (whether they are available or not), as it is weird to have ASR task dataset without any transcriptions for the audio.... @SamuelCahyawijaya @holylovenia @sabilmakbar what do you think?

In the meantime, I will review the dataloader using the available information.

I agree with you both, @jen-santoso and @MJonibek. If there's no transcription, it'd be better to change the task.

Enliven26 commented 7 months ago

May I know to which task it is best to change?

holylovenia commented 7 months ago

May I know to which task it is best to change?

Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26?

I don't think there's any seacrowd task we can use unless we get the transcription. Maybe we should just implement the source schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?

Enliven26 commented 7 months ago

May I know to which task it is best to change?

Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26?

I don't think there's any seacrowd task we can use unless we get the transcription. Maybe we should just implement the source schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?

May I know where I can find the email @holylovenia ? Thanks!

holylovenia commented 7 months ago

May I know to which task it is best to change?

Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26? I don't think there's any seacrowd task we can use unless we get the transcription. Maybe we should just implement the source schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?

May I know where I can find the email @holylovenia ? Thanks!

Please try with this email: Zara.Maxwell-Smith@anu.edu.au I got it from this paper.

jensan-1 commented 7 months ago

May I know to which task it is best to change?

Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26?

I don't think there's any seacrowd task we can use unless we get the transcription. Maybe we should just implement the source schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?

@holylovenia @MJonibek @Enliven26 In the meantime, we should just implement the source schema only for this PR... Once we get the transcription, we can re-open (or create a new) issue. Please let us know your opinions.

holylovenia commented 7 months ago

May I know to which task it is best to change?

Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26? I don't think there's any seacrowd task we can use unless we get the transcription. Maybe we should just implement the source schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?

@holylovenia @MJonibek @Enliven26 In the meantime, we should just implement the source schema only for this PR... Once we get the transcription, we can re-open (or create a new) issue. Please let us know your opinions.

Agreed. I added a source-only flag to the issue.

Enliven26 commented 7 months ago

Please review my changes on removing seacrowd schema from the dataloader, thanks! @jen-santoso @MJonibek

jensan-1 commented 7 months ago

Thank you for the update @Enliven26! LGTM, merging now...