synthesis audio are different.

fann1993814 commented 5 years ago

Hi, @geneing , thanks for your repo. I found a weird phenomenon. The model would generate audio during training, Then I used the same checkpoint to synthesis audio.

I try to compare them (using same checkpoint and test mel file)

The audio which synthesis by loading checkpoint sounds bad than the audio which are saving at training time. The voice volume are smaller and background noise are increase in the audio which synthesis by loading checkpoint.

geneing commented 5 years ago

@fann1993814 That's odd.

Are you using synthesize.py for synthesis, or are you using the C++ library (e.g. using WaveRNNVocoder module)?

synthesize.py uses the same code as train.py. The one possible difference is that synthesize calls model.train(False), which may affect some layers. Could you comment out that line and try again?

fann1993814 commented 5 years ago

Hi, @geneing I also use both synthesize method. (c++ infer & torch infer) And I already comment out model.train(False), But those audios, which synthesis by loading checkpoint, still sound worse.

(My dataset's sample rate is 22K, and I don't use preemphasize)

My synthesize code is below:

vocoder=WaveRNNVocoder.Vocoder()

#model.bin is converted by checkpoint_step000800000.pth
vocoder.loadWeights('model_outputs/model.bin')

fname = 'test_0_mel.npy'

mel = np.load(fname)
mel = mel.astype('float32')

mel0 = mel.copy()
start = time.time()
wav = vocoder.melToWav(mel0)
print(time.time() - start)

librosa.output.write_wav('test_orig.wav', wav, 22050)

device = 'cuda'
latest_checkpoint = 'checkpoints/checkpoint_step000800000.pth'
model = build_model().to(device)
checkpoint = torch.load(latest_checkpoint, map_location=device)
model.load_state_dict(checkpoint["state_dict"])

mel1 = mel.copy()
start = time.time()
output = model.generate(mel1, batched=False)
print(time.time() - start)
librosa.output.write_wav('test_orig_torch.wav', wav, 22050)

geneing commented 5 years ago

In model_outputs directory, I added mel input file that matches the model.bin training parameters. The wav file is what I get back.

Since your dataset is 22050 Hz, have you adjusted the hparams file? In particular hop_size, win_size, sample_rate, upsample_factors, which are all related.

fann1993814 commented 5 years ago

Yes, I already adjusted the hparams file.

num_mels=80,
fmin=125,
fmax=7600,
n_fft=1024,
hop_size=256,
win_size=1024,
sample_rate=22050,

upsample_factors=(4, 4, 16),

fann1993814 commented 5 years ago

Hi, @geneing I use preemphasize for training, and the problem is solved. And I found the compile script "-ffast-math" would result in NaN error in C++. The option may optimize some operation via reducing float precision. (I guess) If canceling the option is not effect the inference speed. (It's my recommendation)

geneing / WaveRNN-Pytorch

synthesis audio are different. #6