BUTSpeechFIT / DiaPer

MIT License
31 stars 1 forks source link

Guide for preprocessing data for training #3

Open MLMonkATGY opened 2 weeks ago

MLMonkATGY commented 2 weeks ago

Thank you for great work ! I need some guidance for preprocessing the training data using https://github.com/BUTSpeechFIT/EEND_dataprep?tab=readme-ov-file as there is 2 different data prep methods in the repo.

  1. Is both of the methods in data prep repo used ? Is there any documentations for the structure of the audio data required such as in huggingface repo or openslr ?

  2. I noticed there is prepare_data_dir.sh in example folder and process_data.py in diaper folder . In what order should data_prep process, prepare_data_dir.sh and process_data.py be applied for preparing training dataset?

Thanks in advance.

fnlandini commented 2 weeks ago

Hi,

The two versions in the data preparation repository correspond to the two publications but both are independent and equally valid. The logic is pretty much the same but the difference is: v1 has the recipe for generating telephone-based data v2 has the recipes for generating wide-band data (we explored three datasets as audio source: LibriSpeech, VoxCeleb2, VoxPopuli) Which one to run will depend on what kind of model you want to train (narrowband or wideband) but you do not need to run both.

You can see in https://github.com/BUTSpeechFIT/EEND_dataprep/blob/main/v2/LibriSpeech/prepareKaldidata_LibriSpeech.sh how the data preparation works. If you set the paths in https://github.com/BUTSpeechFIT/EEND_dataprep/blob/main/v2/LibriSpeech/path.sh for the directory where you have downloaded LibriSpeech and Kaldi directory, it should run and prepare the data directories needed to training the model. I am attaching a few files so that you see how the files inside the data directory should look like.

prepare_data_dir.sh generates the training data for the model based on simulated conversations. Inside DiaPer, process_data.py is an optional script in case you want to precompute the features and store them. However, that is not necessary as the model can be trained if you pass the data directory with the simulated conversations. dev_clean.zip

I hope this helps.