Closed aayushkubb closed 3 years ago
If you have trained deep voice 3 with default configuration. The returned mel_outputs
of the module is a downsampled mel spectrogram (downsample factor =4) by the decoder.
In deep voice 3, the converter upsamples the mel spectrogram and convert it into linear_outputs
, which in used as the input of griffin-lim vocoder. You can check the shape of mel_outputs
and linear_outputs
, the linear_outputs
has a larger time steps than mel_outputs
. (Both has a BCT layout, B:batch_size, C: channel, T: time steps.)
So my advice is to use the linear_outputs
and transform it into mel spectrogram required by waveflow.
Note that deep voice 3's linear_outputs
has its range in [0, 1), which is the result of dB scaling and some normalization. You can invert thes procedures to get the original spectrogram S
, and convert it into mel spectrogam with librosa. Then dB scale and normalize it again into range [0, 1).
Then it meets the need for waveflow.(not downsampled mel spectrogram in range [0, 1)).
Maybe you can check the preprocessing of deep voice 3 (Transform
in exampls/deepvoice3/data.py
)or waveflow.
# STFT
D = librosa.stft(y=y,
n_fft=self.n_fft,
win_length=self.win_length,
hop_length=self.hop_length)
S = np.abs(D)
# to db and normalize to 0-1
amplitude_min = np.exp(self.min_level_dbnp.log(10)) # 1e-5
S_norm = 20 * np.log10(np.maximum(amplitude_min,
S)) - self.ref_level_db
S_norm = (S_norm - self.min_level_self.min_level_db)
S_norm = self.max_norm * S_norm
if self.clip_norm:
S_norm = np.clip(S_norm, 0, self.max_norm)
# mel scale and to db and normalize to 0-1,
# CAUTION: pass linear scale S, not dbscaled S
S_mel = librosa.feature.melspectrogram(S=S,
n_mels=self.n_mels,
fmin=self.fmin,
fmax=self.fmax,
power=1.)
S_mel = 20 * np.log10(np.maximum(amplitude_min,
S_mel)) - self.ref_level_db
S_mel_norm = (S_mel - self.min_level_db) / (-self.min_level_db)
S_mel_norm = self.max_norm * S_mel_norm
if self.clip_norm:
S_mel_norm = np.clip(S_mel_norm, 0, self.max_norm)
And we are now also working on combining TTS model with our neural vocoders.
Hey,
Thanks for the response I tried as you suggested:
wav_mel,linear_outputs, attn = eval_model(dv3, text, replace_pronounciation_prob,
min_level_db, ref_level_db, power,
n_iter, win_length, hop_length,
preemphasis,mel_only=mel_only)
#reshaping it
linear_outputs_np = linear_outputs.numpy()[0].T # (C, T)
#Denormalize and scaling
denoramlized = np.clip(linear_outputs_np, 0, 1) * (-min_level_db) + min_level_db
lin_scaled = np.exp((denoramlized + ref_level_db) / 20 * np.log(10))
#get mel spec
S_mel = librosa.feature.melspectrogram(S=lin_scaled, n_mels=config['transform']['n_mels'], fmin=config['transform']['fmin'], fmax=config['transform']['fmax'], power=1.)
# get config values
max_norm=config['transform']['max_norm']
amplitude_min = np.exp(min_level_db / 20 * np.log(10)) # 1e-5
# db scale again
S_mel = 20 * np.log10(np.maximum(amplitude_min,
S_mel)) - ref_level_db
#Normalize again
S_mel_norm = (S_mel - min_level_db) / (-min_level_db)
S_mel_norm = max_norm * S_mel_norm
#clip again to 0,1
if config['transform']['clip_norm']:
S_mel_norm = np.clip(S_mel_norm, 0, 1)
#reshape
a,b=S_mel_norm.shape
S_mel_norm=S_mel_norm.reshape(1,a,b)
#Convert to fluid type
S_mel_norm=dg.to_variable(S_mel_norm)
#pass the mel to waveflow
wav, start_time,syn_time = waveflow_model.infer(S_mel_norm)
wav = wav[0]
wav_time = wav.shape[0] / waveflow_config.sample_rate
print("audio time {:.4f}, synthesis time {:.4f}".format(wav_time,
syn_time))
# Denormalize audio from [-1, 1] to [-32768, 32768] int16 range.
wav = wav.numpy().astype("float32") * 32768.0
wav = wav.astype('int16')
sample_rate = waveflow_config.sample_rate
plot_alignment(
attn,
os.path.join(synthesis_dir,
"test_{}_step_{}.png".format(idx, iteration)))
sf.write(
os.path.join(synthesis_dir,
"test_{}_step{}.wav".format(idx, iteration)),
wav, sample_rate)
but i am still getting noise in the output. Can you point where exactly i am messing up?
Does the attn plot looks like a diagonal line?( as expected)
And also make sure that after load_parameters
to a deep voice 3 model, removing weight norm for every WeightNormWrapped layer is necessary.
Hey,
Yes so I am removing the norm for every WeightNormWrapped layer and also the attn plot looks decently diagonal.
There were couple of changes I did as I was not using the correct mel extraction,
wav_mel,linear_outputs, attn = eval_model(dv3, text, replace_pronounciation_prob,
min_level_db, ref_level_db, power,
n_iter, win_length, hop_length,
preemphasis,mel_only=mel_only)
linear_outputs_np = linear_outputs.numpy()[0].T # (C, T)
denoramlized = np.clip(linear_outputs_np, 0, 1) * (-min_level_db) + min_level_db
lin_scaled = np.exp((denoramlized + ref_level_db) / 20 * np.log(10))
After extracting linear outputs from deepvoice3, I convert them to numpy , then denormalize and scale the linear features.
Secondly(this is the part where i made changes compared to my previous code), I am extracting mel features out here:
#get mel spec
mel_filter_bank = librosa.filters.mel(sr=sample_rate,
n_fft=waveflow_config.fft_size,
n_mels=waveflow_config.mel_bands,
fmin=waveflow_config.mel_fmin,
fmax=waveflow_config.mel_fmax)
Now, i take the dot product of the magnitured i.e. lin_scaled and multiply it with mel_filter_bank similar to what waveflow https://github.com/PaddlePaddle/Parakeet/blob/develop/examples/waveflow/data.py#L51
and also raising to the power(there was no real listening affect of adding power)
mel = np.dot(mel_filter_bank, np.abs(lin_scaled)**power)
Once i have the mels i normalize,reshape and convert to fluid them to make them usable:
# Normalize mel.
clip_val = 1e-5
ref_constant = 1
mel = np.log(np.clip(mel, a_min=clip_val, a_max=None) * ref_constant)
# reshape
a, b = mel.shape
S_mel_norm = mel.reshape(1, a, b)
# Convert to fluid type
S_mel_norm = dg.to_variable(S_mel_norm)
#pass the mel to waveflow
wav, start_time,syn_time = waveflow_model.infer(S_mel_norm)
wav = wav[0]
wav_time = wav.shape[0] / waveflow_config.sample_rate
print("audio time {:.4f}, synthesis time {:.4f}".format(wav_time,
syn_time))
# Denormalize audio from [-1, 1] to [-32768, 32768] int16 range.
wav = wav.numpy().astype("float32") * 32768.0
wav = wav.astype('int16')
sample_rate = waveflow_config.sample_rate
Now i am able to get a clear output, however the volume is pretty low.
have a look here: Audio sample1: https://sndup.net/4673 Its attn plot:
Audio sample2: https://sndup.net/8b79 Its attn plot:
I am using the default configs only. However by changing db scale values like min_level_db: -100 ref_level_db: 20
I can get louder wav output but it leads to addition of noise in some audios and at times fully noisy audios. example: https://sndup.net/55nw
Can you please help as in what parameter maybe going wrong, also we have different fmin and fmax in the deepvoice3 and waveflow.
Do they need to be tweaked? Or any other parameter? Also can you briefly explain the use of the improtant variables during inference time?
Hey, Can anyone help me out on this, I believe I am very close to the solution but probably not using the models properly.
Any help or suggestion is appreciated.
Thanks
using librosa.feature.melspectrogram(S)
seems not to have a difference than creating the mel_basis
and multiplay it with the spectrogram.
Oh, I made a mistake about the range of the mel-spectrogram required by waveflow. It just log-scale the mel-spectrogram so the acceptable range is not [0, 1).
In my option, you have converted the spectrogram generated by deep voice 3 into mel spectrogram which is acceptable for the waveflow model in the proper way.
As to the fmin
and fmax
, I think it's okay to use the config of the vocoder since these two parameter only affect the range of frequency which is considered when generating the mel spectrogram. But we are using spectrogram now, so it may not be important.
For another problem that the volume of the synthesized waveform is low, I think I should look into the statistics of the generated mel spectrogram(transformed from the spectrogram) and that of extracted mel spectrogram from audio files to see the difference.
Another problem is that, deep voice 3 trains with spectrogram extracted with stft(n_fft=1024, win_length=1024, hop_length=256), but waveflow trains with spectrogram extracted with stft(n_fft=2048, win_length=1024, hop_length=256). This mismatch may cause some problems.
Hey, after all these experiments and using the latest pre-trained models also the output is not as refined as present in the demo.
We changed the n_fft and stuff too, Also added preemphasis check which we were missing before: https://github.com/PaddlePaddle/Parakeet/blob/develop/examples/deepvoice3/utils.py#L341
The output is better but nothing close to the one showed in the demo. Can you help with how to proceed with this?
Any update here?
In general, synthesizing from predicted mel spectrogram would not be as good as synthesizing from ground truth mel spectrogram.
We have recently re-implemente deep voice 3 to make it a faithful implementation as described in the paper(the current implementation is not). And we are training a deep voice 3 model, with a Waveflow model as the vocoder. It would be released in the next update. Currently it is going well.
I am trying to use WaveFlow vocoder with deepvoice3. In order to implement the same i have made minor tweaks in the codebase,:
Firstly i have modified the examples/deepvoice3/utils.py to output only mel bands rather than the synthesized wav
Now i call the modified eval_model from deepvoice3 to return mel output
Reshape mel to match waveflow's mel input
Once i have these mels, i pass it to waveflow for synthesis
But I only get either noise or blank wav output.
I also tried processing mels similar to what waveflow does,
but still the results are same, Can you help me identify what exactly i am doing wrong?
My assumption is that I am not properly supplying mels to the waveflow.
Thanks