Guide for preprocessing data for training

Hi,

The two versions in the data preparation repository correspond to the two publications but both are independent and equally valid. The logic is pretty much the same but the difference is: v1 has the recipe for generating telephone-based data v2 has the recipes for generating wide-band data (we explored three datasets as audio source: LibriSpeech, VoxCeleb2, VoxPopuli) Which one to run will depend on what kind of model you want to train (narrowband or wideband) but you do not need to run both.

You can see in https://github.com/BUTSpeechFIT/EEND_dataprep/blob/main/v2/LibriSpeech/prepareKaldidata_LibriSpeech.sh how the data preparation works. If you set the paths in https://github.com/BUTSpeechFIT/EEND_dataprep/blob/main/v2/LibriSpeech/path.sh for the directory where you have downloaded LibriSpeech and Kaldi directory, it should run and prepare the data directories needed to training the model. I am attaching a few files so that you see how the files inside the data directory should look like.

prepare_data_dir.sh generates the training data for the model based on simulated conversations. Inside DiaPer, process_data.py is an optional script in case you want to precompute the features and store them. However, that is not necessary as the model can be trained if you pass the data directory with the simulated conversations. dev_clean.zip

I hope this helps.

BUTSpeechFIT / DiaPer

Guide for preprocessing data for training #3