cjerry1243 / TransferLearning-CLVC

Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion
39 stars 11 forks source link

trained ppg2mel model can not synthesize voice #4

Open JohnHerry opened 3 years ago

JohnHerry commented 3 years ago

Thanks for your good job for VC. I have tried to train the ppg2mel model and iterator over 50K steps. I did not train the wavegrow vocoder, instead , I use the traditional grifflim vocoder, [in the convert_speech_vcc.py source file] with code like:

wave = librosa.feature.inverse.mel_to_audio(mel.cpu().numpy().astype(np.float32)[0])

Here I convert the predicted mel from float16 to float32, because the librosa function seem do not accept the np.float16 data type. convert command like that

python convert_speech_vcc.py -vcc "vcc2020_evaluation" -ch "CLVC/ckpt/checkpoint_98000.pt" -m "ppg/trace512_uni_77_epoch-142_feature.pth" -wg unknown -o output

But the converted wave is a mess, I can hear nothing.

Is this because that the output mel are special so that it can only be decode to wavform with the wavglow vocode in this project? Or anything else?

I noticed that in data preprocessing. the prepare_h5.py source file. The train mel data are from 24k samples, while the ppg feature are extracted from 16k samples downsampled from 24k orginals. Is this the reason of my problem?

Thanks!

JohnHerry commented 3 years ago

image image image

As shown, the training process is not good. I think there maybe some error in the preprocessing before training.

yemaozi88 commented 3 years ago

Hi John,

I had also trouble training the conversion model. In my case, the reason was that mask was not correctly created in parse_ourput() in model.py. The following code fixed the issue.

#mask = ~get_mask_from_lengths(output_lengths)
mask = get_mask_from_lengths(output_lengths)
a = ~mask.bool()
mask = a.byte()

If that is not the case, please check the Tensorboard,

Good luck!

Aki

zhangxinaaaa commented 3 years ago

Hi John,

I had also trouble training the conversion model. In my case, the reason was that mask was not correctly created in parse_ourput() in model.py. The following code fixed the issue.

#mask = ~get_mask_from_lengths(output_lengths)
mask = get_mask_from_lengths(output_lengths)
a = ~mask.bool()
mask = a.byte()

If that is not the case, please check the Tensorboard,

  • Were your training loss and validation loss decreasing?
  • Did 'mel_predicted' under Image tab show correct melspectrogram? If both are yes, the problem would very likely be the vocoder.

Good luck!

Aki

Hi John,

I had also trouble training the conversion model. In my case, the reason was that mask was not correctly created in parse_ourput() in model.py. The following code fixed the issue.

#mask = ~get_mask_from_lengths(output_lengths)
mask = get_mask_from_lengths(output_lengths)
a = ~mask.bool()
mask = a.byte()

If that is not the case, please check the Tensorboard,

  • Were your training loss and validation loss decreasing?
  • Did 'mel_predicted' under Image tab show correct melspectrogram? If both are yes, the problem would very likely be the vocoder.

Good luck!

Aki

Hi, I met the same problem, and I'm retrying with your method. By the way, what's your current version of pytorch? Mine is 1.6, maybe the version mismatch caused the problem?

yemaozi88 commented 3 years ago

I am using torch==1.4. The original code suggested to install torch==1.1 but apex which is needed for fp16 cannot be installed then. I will check torch==1.1 to be sure.

zhangxinaaaa commented 3 years ago

I am using torch==1.4. The original code suggested to install torch==1.1 but apex which is needed for fp16 cannot be installed then. I will check torch==1.1 to be sure.

Hi, thanks for your advice, it seems that the converter was trained successfully. image image image

However, the samples i inferenced is still nothing. I guess maybe the vocoder i trained before is wrong, can you provide the vocoder you trained? Thanks for your time!

cjerry1243 commented 3 years ago

I would recommend training the vocoder first before you train the converter. Normally it takes a few days for the vocoder to output decent voices. You can check that before you move on to train the converter

zhangxinaaaa commented 3 years ago

Hi, when i run the convert_speech_vcc.py. The trained converter can generated perfect mel spectrum from training data. But for evaluation data, the generated mel spectrum seems wrong.

Mel spectrum from training data image

Mel spectrum from evaluation data image

Here is the converter checkpoint_49000.pt i trained and the parameters hparams_spk.py i used. And i modify the model.py as yemaozi88 said. Can you help me to check what is wriong? Thanks!!

cjerry1243 commented 3 years ago

Looks like your model only learns to predict Mel from previous step. You can check the intermediate tensors in the model to verify this. A possible solution to this is to look at your input and output sequence in the processed data. Are they having the same length? Or is the audio is too noisy so the PPG is not correctly recognized?