X-LANCE / SLAM-LLM

Speech, Language, Audio, Music Processing with Large Language Model
MIT License
571 stars 52 forks source link

Data Form of the MaLa-ASR #130

Closed zsLin177 closed 22 hours ago

zsLin177 commented 2 months ago

System Info

torch 2.1

Information

🐛 Describe the bug

bash decode_MaLa-ASR_withkeywords_L95.sh

Hi, I'm currently working on reproducing the results of MaLa-ASR and have downloaded the slidespeech dataset from https://www.openslr.org/144/. While running the provided decoding script, I noticed that it requires the file located at /nfs/yangguanrou.ygr/slidespeech/${split}_oracle_v1/. Could you please clarify what the format of this file is? Do I need to preprocess the downloaded data in any specific way, such as splitting the audio based on timestamps?

Error logs

no file named test_oracle_v1

Expected behavior

Could you please provide the steps for data processing and explain the format of the data? Thanks, looking forward to your reply.

yanghaha0908 commented 1 month ago

The location of the slidespeech dataset can be modified through config file "mala_asr_config.py". You can change "/nfs/yangguanrou.ygr/slidespeech/${split}_oracle_v1/." to your own path.

The dataset requires four files: "my_wav.scp", "utt2num_samples", "text", "hot_related/ocr_1gram_top50_mmr070_hotwords_list"

"my_wav.scp" is a file of audio path lists. We transform wav file to ark file, so this file looks like ID1 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:22 ID2 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:90445

To generate this file, you can get audio wavs from https://www.openslr.org/144/ and get the time segments from https://slidespeech.github.io/. It provides segments, transcription text, OCR results at https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/SlideSpeech/related_files.tar.gz (~1.37GB). You need to segment the wav by the timestamps provided in segments file

This _relatedfiles.tar.gz also provides "text" and a file named "keywords". The file "keywords" refers to "hot_related/ocr_1gram_top50_mmr070_hotwords_list", which contains hotwords list.

"utt2num_samples" contains the length of the wavs, which looks like ID1 103680 ID2 181600 ...

Sorry for the late reply, been busy lately, hope your reproduction goes well!

nuaalixu commented 1 month ago

@yanghaha0908 Thank you for your answer. It is strongly recommended that this answer be written into the mala README file.

yanghaha0908 commented 22 hours ago

I have added it to the README.md file of Mala-ASR, refer to https://github.com/X-LANCE/SLAM-LLM/pull/168.