NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.06k stars 1.38k forks source link

What is the melspectrogram generating pipeline used for the pretrained model? #243

Closed Robinysh closed 4 years ago

Robinysh commented 5 years ago

I tried to use the follow script to generate my melspectrograms. https://github.com/NVIDIA/tacotron2/blob/131c1465b48be60cb5d3b8ab79cfc663e5c47b6a/data_utils.py#L37-L54 However the scale of them doesn't seem right, they are way too small. While the volume of my audio is clearly audible, most of my melspectrogram is clipped away by the default value of https://github.com/NVIDIA/tacotron2/blob/131c1465b48be60cb5d3b8ab79cfc663e5c47b6a/audio_processing.py#L78-L84 So I am wondering are these really the code used to generate the melspectrograms? I am planning to fine tune the pretrained model with my own dataset so it would be best for the data to have similar preprocessing pipelines and statistics. Thanks.

rafaelvalle commented 5 years ago

Check the range of your audio file using this:

from scipy.io.wavfile import read
sampling_rate, data = read(audio_path)
print(data.min(), data.max())

My guess is that your min and max are within [-1, 1].

rafaelvalle commented 4 years ago

Closing due to inactivity.