Closes #274 | Create OIL data loader

Enliven26 commented 8 months ago

Closes #274

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

MJonibek commented 8 months ago

@Enliven26 I checked the dataloader. Works fine. However, I am not sure does it fulfill the purpose of SPEECH_RECOGNITION task. Because in this dataset every audio instead of the transcript has "label" (that is just id). Maybe we need to find where transcripts are stored (they are not on hugging face) or change the task.

@jen-santoso What do you think?

Enliven26 commented 7 months ago

Hi @MJonibek , may I know if the transcript is found? Thanks

jensan-1 commented 7 months ago

@Enliven26 I checked the dataloader. Works fine. However, I am not sure does it fulfill the purpose of SPEECH_RECOGNITION task. Because in this dataset every audio instead of the transcript has "label" (that is just id). Maybe we need to find where transcripts are stored (they are not on hugging face) or change the task.

@jen-santoso What do you think?

Sorry for the late notice. I think we should try to ask the authors about the transcripts (whether they are available or not), as it is weird to have ASR task dataset without any transcriptions for the audio.... @SamuelCahyawijaya @holylovenia @sabilmakbar what do you think?

In the meantime, I will review the dataloader using the available information.

Enliven26 commented 7 months ago

Hi, changed the citation and added the init file. Is there any other necessary changes? Thanks

jensan-1 commented 7 months ago

@Enliven26 Thanks for the updates! we are still currently working on finding the transcripts. the current changes LGTM, but we will need to inform you later depending on the circumstances with the transcripts.

holylovenia commented 7 months ago

@Enliven26 I checked the dataloader. Works fine. However, I am not sure does it fulfill the purpose of SPEECH_RECOGNITION task. Because in this dataset every audio instead of the transcript has "label" (that is just id). Maybe we need to find where transcripts are stored (they are not on hugging face) or change the task. @jen-santoso What do you think?

Sorry for the late notice. I think we should try to ask the authors about the transcripts (whether they are available or not), as it is weird to have ASR task dataset without any transcriptions for the audio.... @SamuelCahyawijaya @holylovenia @sabilmakbar what do you think?

In the meantime, I will review the dataloader using the available information.

I agree with you both, @jen-santoso and @MJonibek. If there's no transcription, it'd be better to change the task.

Enliven26 commented 7 months ago

May I know to which task it is best to change?

holylovenia commented 7 months ago

May I know to which task it is best to change?

Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26?

I don't think there's any seacrowd task we can use unless we get the transcription. Maybe we should just implement the source schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?

Enliven26 commented 7 months ago

May I know to which task it is best to change?

Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26?

I don't think there's any seacrowd task we can use unless we get the transcription. Maybe we should just implement the source schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?

May I know where I can find the email @holylovenia ? Thanks!

holylovenia commented 7 months ago

May I know to which task it is best to change?

Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26? I don't think there's any seacrowd task we can use unless we get the transcription. Maybe we should just implement the source schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?

May I know where I can find the email @holylovenia ? Thanks!

Please try with this email: Zara.Maxwell-Smith@anu.edu.au I got it from this paper.

jensan-1 commented 7 months ago

May I know to which task it is best to change?

Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26?

I don't think there's any seacrowd task we can use unless we get the transcription. Maybe we should just implement the source schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?

@holylovenia @MJonibek @Enliven26 In the meantime, we should just implement the source schema only for this PR... Once we get the transcription, we can re-open (or create a new) issue. Please let us know your opinions.

holylovenia commented 7 months ago

May I know to which task it is best to change?

Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26? I don't think there's any seacrowd task we can use unless we get the transcription. Maybe we should just implement the source schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?

@holylovenia @MJonibek @Enliven26 In the meantime, we should just implement the source schema only for this PR... Once we get the transcription, we can re-open (or create a new) issue. Please let us know your opinions.

Agreed. I added a source-only flag to the issue.

Enliven26 commented 7 months ago

Please review my changes on removing seacrowd schema from the dataloader, thanks! @jen-santoso @MJonibek

jensan-1 commented 7 months ago

Thank you for the update @Enliven26! LGTM, merging now...

SEACrowd / seacrowd-datahub

Closes #274 | Create OIL data loader #389

Checkbox