Closed zsLin177 closed 22 hours ago
The location of the slidespeech dataset can be modified through config file "mala_asr_config.py". You can change "/nfs/yangguanrou.ygr/slidespeech/${split}_oracle_v1/." to your own path.
The dataset requires four files: "my_wav.scp", "utt2num_samples", "text", "hot_related/ocr_1gram_top50_mmr070_hotwords_list"
"my_wav.scp" is a file of audio path lists. We transform wav file to ark file, so this file looks like ID1 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:22 ID2 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:90445
To generate this file, you can get audio wavs from https://www.openslr.org/144/ and get the time segments from https://slidespeech.github.io/. It provides segments, transcription text, OCR results at https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/SlideSpeech/related_files.tar.gz (~1.37GB). You need to segment the wav by the timestamps provided in segments file
This _relatedfiles.tar.gz also provides "text" and a file named "keywords". The file "keywords" refers to "hot_related/ocr_1gram_top50_mmr070_hotwords_list", which contains hotwords list.
"utt2num_samples" contains the length of the wavs, which looks like ID1 103680 ID2 181600 ...
Sorry for the late reply, been busy lately, hope your reproduction goes well!
@yanghaha0908 Thank you for your answer. It is strongly recommended that this answer be written into the mala README file.
I have added it to the README.md file of Mala-ASR, refer to https://github.com/X-LANCE/SLAM-LLM/pull/168.
System Info
torch 2.1
Information
🐛 Describe the bug
Hi, I'm currently working on reproducing the results of MaLa-ASR and have downloaded the slidespeech dataset from https://www.openslr.org/144/. While running the provided decoding script, I noticed that it requires the file located at /nfs/yangguanrou.ygr/slidespeech/${split}_oracle_v1/. Could you please clarify what the format of this file is? Do I need to preprocess the downloaded data in any specific way, such as splitting the audio based on timestamps?
Error logs
no file named test_oracle_v1
Expected behavior
Could you please provide the steps for data processing and explain the format of the data? Thanks, looking forward to your reply.