NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

Telephone call recordings dataset #452

Closed flassTer closed 5 years ago

flassTer commented 5 years ago

Hello NVIDIA, thank you for this repository, it is very helpful! I am trying to construct a dataset based on call recordings.

The audio is in the form of .wav files. I was wondering whether you had any suggestions on the data annotation process. I have been using an online ASR API to help with the data annotation. Some of the transcriptions are not 100% correct. These annotations are accompanied with audio captions.

Should I keep the data pre-processing as you did it for LibriSpeech or do you think there are some parameters that I should change? I am dealing with a lot of background noise and conversational speech. Would you recommend transfer learning on any of your pre-trained models?

Thanks a lot!

vsl9 commented 5 years ago

I think making it similar to LibriSpeech is a good starting point. Split all speech data to short utterances (for example, not longer than 16 seconds). Make sure that all ground truth transcriptions do not contain any characters outside of the default alphabet (a, ..., z, ', `). That is, normalize text, convert all numbers and special characters (like$`) to regular words. Use 16kHz sample rate, single channel, 16 bits per channel uncompressed (PCM) audio format. If the original sample rate was 8kHz (and you are not going to fine-tune a pre-trained on LibriSpeech model), than maybe it is worth to keep it. If your dataset is small, then fine-tuning might be beneficial.

flassTer commented 5 years ago

Thank you @vsl9. Your advise is very helpful. Would it be possible to do transfer learning on a pre-trained model you provided but with different base parameters? For example changing the mixed precision training to float32.

blisc commented 5 years ago

If you pull the latest master, you should be able to load the pre-trained mixed precision model, and continue training it in float32. Please let us know if you run into any issues with it.

flassTer commented 5 years ago

@blisc I will try doing that. As far as now, the isuue I am running into is when training with LibriSpeech and then attempting to transfer learn with another dataset, the message "Not enough steps for benchmarking" appears

tayciryahmed commented 5 years ago

@gioannideImp The number of steps is calculated using the size of the current training data set, the batch size and the number of epochs. In this setup (fixed fine-tuning batch size and training size), you should probably increase the number of epochs.