Open MLMonkATGY opened 2 weeks ago
Hi,
The two versions in the data preparation repository correspond to the two publications but both are independent and equally valid. The logic is pretty much the same but the difference is: v1 has the recipe for generating telephone-based data v2 has the recipes for generating wide-band data (we explored three datasets as audio source: LibriSpeech, VoxCeleb2, VoxPopuli) Which one to run will depend on what kind of model you want to train (narrowband or wideband) but you do not need to run both.
You can see in https://github.com/BUTSpeechFIT/EEND_dataprep/blob/main/v2/LibriSpeech/prepareKaldidata_LibriSpeech.sh how the data preparation works. If you set the paths in https://github.com/BUTSpeechFIT/EEND_dataprep/blob/main/v2/LibriSpeech/path.sh for the directory where you have downloaded LibriSpeech and Kaldi directory, it should run and prepare the data directories needed to training the model. I am attaching a few files so that you see how the files inside the data directory should look like.
prepare_data_dir.sh
generates the training data for the model based on simulated conversations. Inside DiaPer, process_data.py
is an optional script in case you want to precompute the features and store them. However, that is not necessary as the model can be trained if you pass the data directory with the simulated conversations.
dev_clean.zip
I hope this helps.
Thank you for great work ! I need some guidance for preprocessing the training data using
https://github.com/BUTSpeechFIT/EEND_dataprep?tab=readme-ov-file
as there is 2 different data prep methods in the repo.Is both of the methods in data prep repo used ? Is there any documentations for the structure of the audio data required such as in
huggingface repo
oropenslr
?I noticed there is
prepare_data_dir.sh
in example folder andprocess_data.py
in diaper folder . In what order shoulddata_prep
process,prepare_data_dir.sh
andprocess_data.py
be applied for preparing training dataset?Thanks in advance.