Closed Enliven26 closed 7 months ago
@Enliven26 I checked the dataloader. Works fine. However, I am not sure does it fulfill the purpose of SPEECH_RECOGNITION task. Because in this dataset every audio instead of the transcript has "label"
(that is just id
). Maybe we need to find where transcripts are stored (they are not on hugging face) or change the task.
@jen-santoso What do you think?
Hi @MJonibek , may I know if the transcript is found? Thanks
@Enliven26 I checked the dataloader. Works fine. However, I am not sure does it fulfill the purpose of SPEECH_RECOGNITION task. Because in this dataset every audio instead of the transcript has
"label"
(that is justid
). Maybe we need to find where transcripts are stored (they are not on hugging face) or change the task.@jen-santoso What do you think?
Sorry for the late notice. I think we should try to ask the authors about the transcripts (whether they are available or not), as it is weird to have ASR task dataset without any transcriptions for the audio.... @SamuelCahyawijaya @holylovenia @sabilmakbar what do you think?
In the meantime, I will review the dataloader using the available information.
Hi, changed the citation and added the init file. Is there any other necessary changes? Thanks
@Enliven26 Thanks for the updates! we are still currently working on finding the transcripts. the current changes LGTM, but we will need to inform you later depending on the circumstances with the transcripts.
@Enliven26 I checked the dataloader. Works fine. However, I am not sure does it fulfill the purpose of SPEECH_RECOGNITION task. Because in this dataset every audio instead of the transcript has
"label"
(that is justid
). Maybe we need to find where transcripts are stored (they are not on hugging face) or change the task. @jen-santoso What do you think?Sorry for the late notice. I think we should try to ask the authors about the transcripts (whether they are available or not), as it is weird to have ASR task dataset without any transcriptions for the audio.... @SamuelCahyawijaya @holylovenia @sabilmakbar what do you think?
In the meantime, I will review the dataloader using the available information.
I agree with you both, @jen-santoso and @MJonibek. If there's no transcription, it'd be better to change the task.
May I know to which task it is best to change?
May I know to which task it is best to change?
Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26?
I don't think there's any seacrowd
task we can use unless we get the transcription. Maybe we should just implement the source
schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?
May I know to which task it is best to change?
Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26?
I don't think there's any
seacrowd
task we can use unless we get the transcription. Maybe we should just implement thesource
schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?
May I know where I can find the email @holylovenia ? Thanks!
May I know to which task it is best to change?
Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26? I don't think there's any
seacrowd
task we can use unless we get the transcription. Maybe we should just implement thesource
schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?May I know where I can find the email @holylovenia ? Thanks!
Please try with this email: Zara.Maxwell-Smith@anu.edu.au I got it from this paper.
May I know to which task it is best to change?
Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26?
I don't think there's any
seacrowd
task we can use unless we get the transcription. Maybe we should just implement thesource
schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?
@holylovenia @MJonibek @Enliven26
In the meantime, we should just implement the source
schema only for this PR...
Once we get the transcription, we can re-open (or create a new) issue.
Please let us know your opinions.
May I know to which task it is best to change?
Good question. I tried asking the dataset owner about the transcriptions here. Could you please try to email her 3 days from now if she still hasn't responded, @Enliven26? I don't think there's any
seacrowd
task we can use unless we get the transcription. Maybe we should just implement thesource
schema if we still can't get the transcriptions in the end. What do you think, @MJonibek @jen-santoso?@holylovenia @MJonibek @Enliven26 In the meantime, we should just implement the
source
schema only for this PR... Once we get the transcription, we can re-open (or create a new) issue. Please let us know your opinions.
Agreed. I added a source-only
flag to the issue.
Please review my changes on removing seacrowd schema from the dataloader, thanks! @jen-santoso @MJonibek
Thank you for the update @Enliven26! LGTM, merging now...
Closes #274
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.