NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.53k stars 3.23k forks source link

FastPitch: How to properly calculate pitch mean, std, fmin and fmax given the pitch estimated in shape of [1xmel_frames]? #1036

Open yerzhan7orazayev opened 2 years ago

yerzhan7orazayev commented 2 years ago

Dear @alancucki ,

How to properly calculate pitch mean, std, fmin and fmax given the pitch estimated in shape of [1xmel_frames]?

Yerzhan.

alancucki commented 2 years ago

Hi @yerzhan7orazayev ,

sorry for a late reply. For pitch mean std, just calculate those statistics over all pitch values in all audio files in the dataset. As for fmin and fmax, for 22kHz keep the default.

jinny1208 commented 2 years ago

Hi @yerzhan7orazayev ,

sorry for a late reply. For pitch mean std, just calculate those statistics over all pitch values in all audio files in the dataset. As for fmin and fmax, for 22kHz keep the default.

@alancucki Do you mind elaborating about the procedure for calculating the pitch mean and std over the entire dataset? What if your dataset has a mixture of different female and male speakers? Does using the pitch mean and std still work?

Also, is there a particular standard for specifying the fmin and fmax for different sampling rates? For example, I have a 16kHz sampled dataset. I still used the default 22kHz fmin and fmax for my 16kHz dataset and didn't hear that much of a difference (I could be wrong), so I was wondering how the fmin and fmax was specified.

Thanks in advance

JohnHerry commented 2 years ago

Hi @yerzhan7orazayev , sorry for a late reply. For pitch mean std, just calculate those statistics over all pitch values in all audio files in the dataset. As for fmin and fmax, for 22kHz keep the default.

@alancucki Do you mind elaborating about the procedure for calculating the pitch mean and std over the entire dataset? What if your dataset has a mixture of different female and male speakers? Does using the pitch mean and std still work?

Also, is there a particular standard for specifying the fmin and fmax for different sampling rates? For example, I have a 16kHz sampled dataset. I still used the default 22kHz fmin and fmax for my 16kHz dataset and didn't hear that much of a difference (I could be wrong), so I was wondering how the fmin and fmax was specified.

Thanks in advance

fmin and fmax can be computed according to the frame-rate of sample audios, when you training on 16KHz samples, your (fmin, fmax) should be (0, 8000), to lower the effect of some noise in samples, you can tune that two values, eg. (40, 7200) for Man speaker, (60, 7800) for Woman speaker.

As to the pitch-mean, pitch-std, I am following the reply of your question.

I also have two other questions: The first, Should the F0 sequence from sample audios to compute pitch-mean and pitch-std contain zero values? the zero values are from unvoiced segments. The second, I see some argument to compute pitch in the code:

librosa.pyin(
    fmin=librosa.note_to_hz("C2"), fmax=librosa.note_to_hz('C7'), frame_length=1024
)

The values of fmin, fmax and frame_length are not identical with the config on mel-spectram, Is that stil ok when I changed mel-spectram arguments before training?