Speech features not using dataset statistics

jangalt commented 5 years ago

When calculating the speech features for the speech2text models, OpenSeq2Seq calculates a mean and stddev individually for each training sample. Much like batch normalization, during inference, it would be better to calculate these constants against the training set to improve validation error. For example, consider the case of a speech input with 2 speakers: during training you will probably slice these into separate utterances, and the mean/stddev will track each utterance individually. If you perform inference against this however, the mean/stddev will get averaged, and not give the same results.

https://github.com/NVIDIA/OpenSeq2Seq/blob/bcdb76bb1221ebbef8eb2cd3cecaa491f801f54d/open_seq2seq/data/speech2text/speech_utils.py#L398-L400

vsl9 commented 5 years ago

Thank you for the suggestion. Sure, exactly. We are aware of limitations of the current signal preprocessing pipeline. While it is fine for LibriSpeech and short audio clips in general, inference on longer utterances (and streaming ASR) requires different normalization schemes. We are working on that.

flassTer commented 5 years ago

@vsl9 So for long audio clips (for example call recordings where the form of speech is conversational and sometimes slang is used) would you suggest to break the audio file into smaller clips and transcribe each one individually? Or is there another suggestion you would recommend for long call recording audio files? Thanks.

Edit: Here I am asking for suggestions on improving WER for these kind of audio files where the speaker may repeat the same word or speak words that do not make sense in context, so a language model may even be damageful. I am thinking if using the acoustic model and then apply word prediction from the phonemes, do you think this is a good idea?

NVIDIA / OpenSeq2Seq

Speech features not using dataset statistics #422