NVIDIA / waveglow

A Flow-based Generative Network for Speech Synthesis
BSD 3-Clause "New" or "Revised" License
2.27k stars 531 forks source link

Abrupt noise, #68

Closed WendongGan closed 5 years ago

WendongGan commented 5 years ago

Does anybody have such a problem? When it is trained for 1000k steps with LjSpeech , the "abrupt noise" appears. For example: image image

The audio file is : LJ001-0007.wav_synthesis_01.zip

My config.json file is: image

I used single GPU。

Look forward your help!

WendongGan commented 5 years ago

Some friends think that the reason is that the dataset is not enough and overfitting appears.

WendongGan commented 5 years ago

My code is from commit f4c04e2. It is commited on Nov 10, 2018。The train costs so long time that I have not use latest code。 Does the latest code have this problem?

Yeongtae commented 5 years ago

Have you make the sample audio from melspectrogram or text?

WendongGan commented 5 years ago

When audio is made from melspectrogram and text, the "abrupt noise" will appear. The Both conditions get the same result of noise.

WendongGan commented 5 years ago

I'm trying the latest code. And I want to know whether the latest commits could solve the problem. For example, image

Yeongtae commented 5 years ago

@UESTCgan Is it solved? my model has similar noise. 8.zip

WendongGan commented 5 years ago

@UESTCgan Is it solved? my model has similar noise. 8.zip

I listened your sample. How many steps have you trained ? How many hours are your dataset of train ? You mean that your noise is this one :
image

I also have this noise, but the "Abrupt noise" is more serious. It is the noise :
image

I‘m trying the latest code of the author。The step is just 100k,it is not enough , so I'm not sure if it could solve the problem. (https://github.com/NVIDIA/waveglow/commit/f4c04e2d968de01b22d2fb092bbbf0cec0b6586f).

Yeongtae commented 5 years ago

My model was trained with 1100epoch. But it has reverb effect.

WendongGan commented 5 years ago

My model was trained with 1100epoch.

How many hours are your dataset of train ?

Yeongtae commented 5 years ago

With 8 v100 gpus in gcp vm, it takes 5 days. My experiment setting is following: Num channels: 8bit Batch size: 80( 10 for each gpu) Another prameters are dafault.

WendongGan commented 5 years ago

How much is your sigma ? I set it as 1.0 when I train and infer.

Yeongtae commented 5 years ago

Sigma is Sqrt(0.5) ~ 0.7071.... for training. It is default in the waveglow paper.

Sigma is 0.66 for inference. It is default in the demo.

WendongGan commented 5 years ago

Increase the sigma when infering , background noise will decrease.
image

Yeongtae commented 5 years ago

But big sigma makes more reverb effect.

WendongGan commented 5 years ago

But big sigma makes more reverb effect.

I see, thank you !

Yeongtae commented 5 years ago

My model was trained with 1100epoch.

How many hours are your dataset of train ?

my dataset consist of 13000 sentences and 10 hours.

yxt132 commented 5 years ago

Does anybody have such a problem? When it is trained for 1000k steps with LjSpeech , the "abrupt noise" appears. For example: image image

The audio file is : LJ001-0007.wav_synthesis_01.zip

My config.json file is: image

I used single GPU。

Look forward your help!

I saw you used 16k sampling rate. Isn't the sampling rate 22050 for the LJSPEECH dataset? Or does it matter? What does the segment length do? Does it have to be consistent with the sampling rate?

rafaelvalle commented 5 years ago

Segment length is independent of sampling rate. It is ok to convert LJS to 16khz. Note that if training tacotron in parallel, it must have the same audio specifications.

rafaelvalle commented 5 years ago

We've shared a quick hack to decrease the fixed noise from model's bias in waveglow : https://github.com/NVIDIA/tacotron2/issues/142#issuecomment-466506044

rafaelvalle commented 5 years ago

Closing due to inactivity.