Open JohnHerry opened 3 years ago
As shown, the training process is not good. I think there maybe some error in the preprocessing before training.
Hi John,
I had also trouble training the conversion model. In my case, the reason was that mask was not correctly created in parse_ourput() in model.py. The following code fixed the issue.
#mask = ~get_mask_from_lengths(output_lengths)
mask = get_mask_from_lengths(output_lengths)
a = ~mask.bool()
mask = a.byte()
If that is not the case, please check the Tensorboard,
Good luck!
Aki
Hi John,
I had also trouble training the conversion model. In my case, the reason was that mask was not correctly created in parse_ourput() in model.py. The following code fixed the issue.
#mask = ~get_mask_from_lengths(output_lengths) mask = get_mask_from_lengths(output_lengths) a = ~mask.bool() mask = a.byte()
If that is not the case, please check the Tensorboard,
- Were your training loss and validation loss decreasing?
- Did 'mel_predicted' under Image tab show correct melspectrogram? If both are yes, the problem would very likely be the vocoder.
Good luck!
Aki
Hi John,
I had also trouble training the conversion model. In my case, the reason was that mask was not correctly created in parse_ourput() in model.py. The following code fixed the issue.
#mask = ~get_mask_from_lengths(output_lengths) mask = get_mask_from_lengths(output_lengths) a = ~mask.bool() mask = a.byte()
If that is not the case, please check the Tensorboard,
- Were your training loss and validation loss decreasing?
- Did 'mel_predicted' under Image tab show correct melspectrogram? If both are yes, the problem would very likely be the vocoder.
Good luck!
Aki
Hi, I met the same problem, and I'm retrying with your method. By the way, what's your current version of pytorch? Mine is 1.6, maybe the version mismatch caused the problem?
I am using torch==1.4. The original code suggested to install torch==1.1 but apex which is needed for fp16 cannot be installed then. I will check torch==1.1 to be sure.
I am using torch==1.4. The original code suggested to install torch==1.1 but apex which is needed for fp16 cannot be installed then. I will check torch==1.1 to be sure.
Hi, thanks for your advice, it seems that the converter was trained successfully.
However, the samples i inferenced is still nothing. I guess maybe the vocoder i trained before is wrong, can you provide the vocoder you trained? Thanks for your time!
I would recommend training the vocoder first before you train the converter. Normally it takes a few days for the vocoder to output decent voices. You can check that before you move on to train the converter
Hi, when i run the convert_speech_vcc.py. The trained converter can generated perfect mel spectrum from training data. But for evaluation data, the generated mel spectrum seems wrong.
Mel spectrum from training data
Mel spectrum from evaluation data
Here is the converter checkpoint_49000.pt i trained and the parameters hparams_spk.py i used. And i modify the model.py as yemaozi88 said. Can you help me to check what is wriong? Thanks!!
Looks like your model only learns to predict Mel from previous step. You can check the intermediate tensors in the model to verify this. A possible solution to this is to look at your input and output sequence in the processed data. Are they having the same length? Or is the audio is too noisy so the PPG is not correctly recognized?
Thanks for your good job for VC. I have tried to train the ppg2mel model and iterator over 50K steps. I did not train the wavegrow vocoder, instead , I use the traditional grifflim vocoder, [in the convert_speech_vcc.py source file] with code like:
wave = librosa.feature.inverse.mel_to_audio(mel.cpu().numpy().astype(np.float32)[0])
Here I convert the predicted mel from float16 to float32, because the librosa function seem do not accept the
np.float16
data type. convert command like thatpython convert_speech_vcc.py -vcc "vcc2020_evaluation" -ch "CLVC/ckpt/checkpoint_98000.pt" -m "ppg/trace512_uni_77_epoch-142_feature.pth" -wg unknown -o output
But the converted wave is a mess, I can hear nothing.
Is this because that the output mel are special so that it can only be decode to wavform with the wavglow vocode in this project? Or anything else?
I noticed that in data preprocessing. the
prepare_h5.py
source file. The train mel data are from 24k samples, while the ppg feature are extracted from 16k samples downsampled from 24k orginals. Is this the reason of my problem?Thanks!