PaddlePaddle / Parakeet

PAddle PARAllel text-to-speech toolKIT (supporting Tacotron2, Transformer TTS, FastSpeech2/FastPitch, SpeedySpeech, WaveFlow and Parallel WaveGAN)
Other
598 stars 83 forks source link

Deep Voice 3 + WaveFlow Noisy Output #9

Closed aayushkubb closed 3 years ago

aayushkubb commented 4 years ago

I am trying to use WaveFlow vocoder with deepvoice3. In order to implement the same i have made minor tweaks in the codebase,:

Firstly i have modified the examples/deepvoice3/utils.py to output only mel bands rather than the synthesized wav

@fluid.framework.dygraph_only
def eval_model(model, text, replace_pronounciation_prob, min_level_db,
               ref_level_db, power, n_iter, win_length, hop_length,
               preemphasis,mel_only=False):
    """generate waveform from text using a deepvoice 3 model"""
    text = np.array(
        en.text_to_sequence(
            text, p=replace_pronounciation_prob),
        dtype=np.int64)
    length = len(text)
    print("text sequence's length: {}".format(length))
    text_positions = np.arange(1, 1 + length)

    text = np.expand_dims(text, 0)
    text_positions = np.expand_dims(text_positions, 0)
    model.eval()
    mel_outputs, linear_outputs, alignments, done = model.transduce(
        dg.to_variable(text), dg.to_variable(text_positions))

    if mel_only:
        return mel_outputs,alignments.numpy()[0]

    linear_outputs_np = linear_outputs.numpy()[0].T  # (C, T)
    wav = spec_to_waveform(linear_outputs_np, min_level_db, ref_level_db,
                          power, n_iter, win_length, hop_length, preemphasis)
    alignments_np = alignments.numpy()[0]  # batch_size = 1
    print("linear_outputs's shape: ", linear_outputs_np.shape)
    print("alignmnets' shape:", alignments.shape)
    return wav, alignments_np

Now i call the modified eval_model from deepvoice3 to return mel output

mel_wav, attn = eval_model(dv3, text, replace_pronounciation_prob,
                                       min_level_db, ref_level_db, power,
                                       n_iter, win_length, hop_length,
                                       preemphasis,mel_only=mel_only)

mel=mel_wav

Reshape mel to match waveflow's mel input

a,b,c = mel.shape
mel_new=F.reshape(mel,(a,c,b))

Once i have these mels, i pass it to waveflow for synthesis

waveflow_model = WaveFlow(waveflow_config,args.waveflow_checkpoint_dir)
waveflow_iteration = waveflow_model.build()

@dg.no_grad
  def infer(self, mel):
    # self.waveflow.eval()
    config = self.config
    print(mel.shape, 'mel.shape')
    start_time = time.time()
    audio = self.waveflow.synthesize(mel, sigma=self.config.sigma)
    syn_time = time.time() - start_time
    return audio,start_time,syn_time

#create wav
wav, start_time,syn_time = waveflow_model.infer(mel_new) 
wav = wav[0]
wav_time = wav.shape[0] / waveflow_config.sample_rate
print("audio time {:.4f}, synthesis time {:.4f}".format(wav_time,
                                                        syn_time))
# Denormalize audio from [-1, 1] to [-32768, 32768] int16 range.
wav = wav.numpy().astype("float32") * 32768.0
wav = wav.astype('int16')
sample_rate = waveflow_config.sample_rate  

plot_alignment(
    attn,
    os.path.join(synthesis_dir,
                    "test_{}_step_{}.png".format(idx, iteration)))
sf.write(
    os.path.join(synthesis_dir,
                    "test_{}_step{}.wav".format(idx, iteration)),
    wav, sample_rate) 

But I only get either noise or blank wav output.

I also tried processing mels similar to what waveflow does,

def process_mel(mel,config):

    '''Normalize mel similar to waveflow'''

    clip_val = 1e-5
    ref_constant = 100
    mel = fluid.layers.clip(x=mel,min=clip_val,max=10)
    mel = fluid.layers.scale( x=mel,scale=ref_constant)
    mel = fluid.layers.log(mel)

    return mel

but still the results are same, Can you help me identify what exactly i am doing wrong?

My assumption is that I am not properly supplying mels to the waveflow.

Thanks

iclementine commented 4 years ago

If you have trained deep voice 3 with default configuration. The returned mel_outputs of the module is a downsampled mel spectrogram (downsample factor =4) by the decoder.

https://github.com/PaddlePaddle/Parakeet/blob/8505805dadecd12a7047574bc1970bcdb21440ab/examples/deepvoice3/train.py#L246

In deep voice 3, the converter upsamples the mel spectrogram and convert it into linear_outputs, which in used as the input of griffin-lim vocoder. You can check the shape of mel_outputs and linear_outputs, the linear_outputs has a larger time steps than mel_outputs. (Both has a BCT layout, B:batch_size, C: channel, T: time steps.)

So my advice is to use the linear_outputs and transform it into mel spectrogram required by waveflow.

Note that deep voice 3's linear_outputs has its range in [0, 1), which is the result of dB scaling and some normalization. You can invert thes procedures to get the original spectrogram S, and convert it into mel spectrogam with librosa. Then dB scale and normalize it again into range [0, 1).

Then it meets the need for waveflow.(not downsampled mel spectrogram in range [0, 1)).

Maybe you can check the preprocessing of deep voice 3 (Transform in exampls/deepvoice3/data.py)or waveflow.

https://github.com/PaddlePaddle/Parakeet/blob/8505805dadecd12a7047574bc1970bcdb21440ab/examples/deepvoice3/data.py#L98

# STFT
D = librosa.stft(y=y,
                 n_fft=self.n_fft,
                 win_length=self.win_length,
                 hop_length=self.hop_length)
S = np.abs(D)

# to db and normalize to 0-1
amplitude_min = np.exp(self.min_level_dbnp.log(10))  # 1e-5
S_norm = 20 * np.log10(np.maximum(amplitude_min,
                                  S)) - self.ref_level_db
S_norm = (S_norm - self.min_level_self.min_level_db)
S_norm = self.max_norm * S_norm
if self.clip_norm:
    S_norm = np.clip(S_norm, 0, self.max_norm)

# mel scale and to db and normalize to 0-1,
# CAUTION: pass linear scale S, not dbscaled S
S_mel = librosa.feature.melspectrogram(S=S,
                                       n_mels=self.n_mels,
                                       fmin=self.fmin,
                                       fmax=self.fmax,
                                       power=1.)
S_mel = 20 * np.log10(np.maximum(amplitude_min,
                                 S_mel)) - self.ref_level_db
S_mel_norm = (S_mel - self.min_level_db) / (-self.min_level_db)
S_mel_norm = self.max_norm * S_mel_norm
if self.clip_norm:
    S_mel_norm = np.clip(S_mel_norm, 0, self.max_norm)

And we are now also working on combining TTS model with our neural vocoders.

aayushkubb commented 4 years ago

Hey,

Thanks for the response I tried as you suggested:

wav_mel,linear_outputs, attn = eval_model(dv3, text, replace_pronounciation_prob,
                        min_level_db, ref_level_db, power,
                        n_iter, win_length, hop_length,
                        preemphasis,mel_only=mel_only)

#reshaping it
linear_outputs_np = linear_outputs.numpy()[0].T  # (C, T)

#Denormalize and scaling
denoramlized = np.clip(linear_outputs_np, 0, 1)  * (-min_level_db) + min_level_db
lin_scaled = np.exp((denoramlized + ref_level_db) / 20 * np.log(10))

#get mel spec
S_mel = librosa.feature.melspectrogram(S=lin_scaled, n_mels=config['transform']['n_mels'], fmin=config['transform']['fmin'], fmax=config['transform']['fmax'], power=1.)

# get config values
max_norm=config['transform']['max_norm']
amplitude_min = np.exp(min_level_db / 20 * np.log(10))  # 1e-5

# db scale again
S_mel = 20 * np.log10(np.maximum(amplitude_min,
                                S_mel)) - ref_level_db
#Normalize again
S_mel_norm = (S_mel - min_level_db) / (-min_level_db)
S_mel_norm = max_norm * S_mel_norm

#clip again to 0,1
if config['transform']['clip_norm']:
    S_mel_norm = np.clip(S_mel_norm, 0, 1)

#reshape
a,b=S_mel_norm.shape
S_mel_norm=S_mel_norm.reshape(1,a,b)

#Convert to fluid type
S_mel_norm=dg.to_variable(S_mel_norm)

#pass the mel to waveflow
wav, start_time,syn_time = waveflow_model.infer(S_mel_norm)

wav = wav[0]
wav_time = wav.shape[0] / waveflow_config.sample_rate
print("audio time {:.4f}, synthesis time {:.4f}".format(wav_time,
                                                        syn_time))
# Denormalize audio from [-1, 1] to [-32768, 32768] int16 range.
wav = wav.numpy().astype("float32") * 32768.0
wav = wav.astype('int16')
sample_rate = waveflow_config.sample_rate  

plot_alignment(
    attn,
    os.path.join(synthesis_dir,
                    "test_{}_step_{}.png".format(idx, iteration)))
sf.write(
    os.path.join(synthesis_dir,
                    "test_{}_step{}.wav".format(idx, iteration)),
    wav, sample_rate) 

but i am still getting noise in the output. Can you point where exactly i am messing up?

iclementine commented 4 years ago

Does the attn plot looks like a diagonal line?( as expected)

And also make sure that after load_parameters to a deep voice 3 model, removing weight norm for every WeightNormWrapped layer is necessary.

https://github.com/PaddlePaddle/Parakeet/blob/8505805dadecd12a7047574bc1970bcdb21440ab/examples/deepvoice3/synthesis.py#L126

图片
aayushkubb commented 4 years ago

Hey,

Yes so I am removing the norm for every WeightNormWrapped layer and also the attn plot looks decently diagonal.

There were couple of changes I did as I was not using the correct mel extraction,


wav_mel,linear_outputs, attn = eval_model(dv3, text, replace_pronounciation_prob,
                        min_level_db, ref_level_db, power,
                        n_iter, win_length, hop_length,
                        preemphasis,mel_only=mel_only)

linear_outputs_np = linear_outputs.numpy()[0].T  # (C, T)

denoramlized = np.clip(linear_outputs_np, 0, 1)  * (-min_level_db) + min_level_db
lin_scaled = np.exp((denoramlized + ref_level_db) / 20 * np.log(10))

After extracting linear outputs from deepvoice3, I convert them to numpy , then denormalize and scale the linear features.

Secondly(this is the part where i made changes compared to my previous code), I am extracting mel features out here:

#get mel spec
mel_filter_bank = librosa.filters.mel(sr=sample_rate,
                            n_fft=waveflow_config.fft_size,
                            n_mels=waveflow_config.mel_bands,
                            fmin=waveflow_config.mel_fmin,
                            fmax=waveflow_config.mel_fmax)

Now, i take the dot product of the magnitured i.e. lin_scaled and multiply it with mel_filter_bank similar to what waveflow https://github.com/PaddlePaddle/Parakeet/blob/develop/examples/waveflow/data.py#L51

and also raising to the power(there was no real listening affect of adding power)

mel = np.dot(mel_filter_bank, np.abs(lin_scaled)**power)

Once i have the mels i normalize,reshape and convert to fluid them to make them usable:

# Normalize mel.
clip_val = 1e-5
ref_constant = 1
mel = np.log(np.clip(mel, a_min=clip_val, a_max=None) * ref_constant)

# reshape
a, b = mel.shape
S_mel_norm = mel.reshape(1, a, b)

# Convert to fluid type
S_mel_norm = dg.to_variable(S_mel_norm)

#pass the mel to waveflow
wav, start_time,syn_time = waveflow_model.infer(S_mel_norm)

wav = wav[0]
wav_time = wav.shape[0] / waveflow_config.sample_rate
print("audio time {:.4f}, synthesis time {:.4f}".format(wav_time,
                                                        syn_time))
# Denormalize audio from [-1, 1] to [-32768, 32768] int16 range.
wav = wav.numpy().astype("float32") * 32768.0
wav = wav.astype('int16')
sample_rate = waveflow_config.sample_rate  

Now i am able to get a clear output, however the volume is pretty low.

have a look here: Audio sample1: https://sndup.net/4673 Its attn plot: after_test_0_step_170000

Audio sample2: https://sndup.net/8b79 Its attn plot: after_test_1_step_170000

I am using the default configs only. However by changing db scale values like min_level_db: -100 ref_level_db: 20

I can get louder wav output but it leads to addition of noise in some audios and at times fully noisy audios. example: https://sndup.net/55nw after_test_0_step_170000

Can you please help as in what parameter maybe going wrong, also we have different fmin and fmax in the deepvoice3 and waveflow.

Do they need to be tweaked? Or any other parameter? Also can you briefly explain the use of the improtant variables during inference time?

aayushkubb commented 4 years ago

Hey, Can anyone help me out on this, I believe I am very close to the solution but probably not using the models properly.

Any help or suggestion is appreciated.

Thanks

iclementine commented 4 years ago

using librosa.feature.melspectrogram(S) seems not to have a difference than creating the mel_basis and multiplay it with the spectrogram.

https://github.com/librosa/librosa/blob/0b6c1167b2dea83c48cec7bf22c4720fdffd0b7a/librosa/feature/spectral.py#L1827

Oh, I made a mistake about the range of the mel-spectrogram required by waveflow. It just log-scale the mel-spectrogram so the acceptable range is not [0, 1).

https://github.com/PaddlePaddle/Parakeet/blob/8505805dadecd12a7047574bc1970bcdb21440ab/examples/waveflow/data.py#L68

图片

In my option, you have converted the spectrogram generated by deep voice 3 into mel spectrogram which is acceptable for the waveflow model in the proper way.

As to the fmin and fmax, I think it's okay to use the config of the vocoder since these two parameter only affect the range of frequency which is considered when generating the mel spectrogram. But we are using spectrogram now, so it may not be important.

For another problem that the volume of the synthesized waveform is low, I think I should look into the statistics of the generated mel spectrogram(transformed from the spectrogram) and that of extracted mel spectrogram from audio files to see the difference.

iclementine commented 4 years ago

Another problem is that, deep voice 3 trains with spectrogram extracted with stft(n_fft=1024, win_length=1024, hop_length=256), but waveflow trains with spectrogram extracted with stft(n_fft=2048, win_length=1024, hop_length=256). This mismatch may cause some problems.

aayushkubb commented 4 years ago

Hey, after all these experiments and using the latest pre-trained models also the output is not as refined as present in the demo.

We changed the n_fft and stuff too, Also added preemphasis check which we were missing before: https://github.com/PaddlePaddle/Parakeet/blob/develop/examples/deepvoice3/utils.py#L341

The output is better but nothing close to the one showed in the demo. Can you help with how to proceed with this?

aayushkubb commented 4 years ago

Any update here?

iclementine commented 4 years ago

In general, synthesizing from predicted mel spectrogram would not be as good as synthesizing from ground truth mel spectrogram.

We have recently re-implemente deep voice 3 to make it a faithful implementation as described in the paper(the current implementation is not). And we are training a deep voice 3 model, with a Waveflow model as the vocoder. It would be released in the next update. Currently it is going well.