fishaudio / fish-diffusion

An easy to understand TTS / SVS / SVC framework
https://diff.fish.audio
MIT License
635 stars 81 forks source link

Type of pitches during the training of nsf_hifigan #115

Closed wblgers closed 1 year ago

wblgers commented 1 year ago

Hi,

Thanks for sharing your work. I want to figure out the type of input pitches during the training of nsf_hifigan. It's continuouse pitch or the raw pitch extracted from ParselMouth.

Thanks!

leng-yue commented 1 year ago

It's extracted from ParselMouth, and then upsampled to audio length.

wblgers commented 1 year ago

It's extracted from ParselMouth, and then upsampled to audio length.

Oh, let me clarify my question as below: On line 75 of tools/diffusion/inference.py

if pitches is None:
            pitches = self.pitch_extractor(audio, sr, pad_to=mel_len).float()

pad_to-mel_len means the 0 pitches are removed and linear interpolated. Is there the same preprocess during the training of nsf_hifigan?

leng-yue commented 1 year ago

It depends on whether you enable keep zeros in pitch extractor or not...

wblgers commented 1 year ago

It depends on whether you enable keep zeros in pitch extractor or not...

Oh, I made a mistake. Yes, It depends on keep_zeros when construct pitch_extractor.

  1. Since the pretrained model of vocoder nsf_hifigan is provided, can you remember whether keep_zeros is enable or not in the training of pretrained nsf_hifigan?
  2. Do you compare the vocoder performance between the training configured as keep_zeros enbaled or disabled?
  3. Will the mismatch of acoustic model and vocoder on keep_zeros lead to performance degradation?
leng-yue commented 1 year ago

keep_zeros is generally better and thus we enabled it.

wblgers commented 1 year ago

Thanks for your explaination!