Closed WendongGan closed 5 years ago
Some friends think that the reason is that the dataset is not enough and overfitting appears.
My code is from commit f4c04e2. It is commited on Nov 10, 2018。The train costs so long time that I have not use latest code。 Does the latest code have this problem?
Have you make the sample audio from melspectrogram or text?
When audio is made from melspectrogram and text, the "abrupt noise" will appear. The Both conditions get the same result of noise.
I'm trying the latest code. And I want to know whether the latest commits could solve the problem. For example,
@UESTCgan Is it solved? my model has similar noise. 8.zip
I listened your sample. How many steps have you trained ? How many hours are your dataset of train ? You mean that your noise is this one :
I also have this noise, but the "Abrupt noise" is more serious. It is the noise :
I‘m trying the latest code of the author。The step is just 100k,it is not enough , so I'm not sure if it could solve the problem. (https://github.com/NVIDIA/waveglow/commit/f4c04e2d968de01b22d2fb092bbbf0cec0b6586f).
My model was trained with 1100epoch. But it has reverb effect.
My model was trained with 1100epoch.
How many hours are your dataset of train ?
With 8 v100 gpus in gcp vm, it takes 5 days. My experiment setting is following: Num channels: 8bit Batch size: 80( 10 for each gpu) Another prameters are dafault.
How much is your sigma ? I set it as 1.0 when I train and infer.
Sigma is Sqrt(0.5) ~ 0.7071.... for training. It is default in the waveglow paper.
Sigma is 0.66 for inference. It is default in the demo.
Increase the sigma when infering , background noise will decrease.
But big sigma makes more reverb effect.
But big sigma makes more reverb effect.
I see, thank you !
My model was trained with 1100epoch.
How many hours are your dataset of train ?
my dataset consist of 13000 sentences and 10 hours.
Does anybody have such a problem? When it is trained for 1000k steps with LjSpeech , the "abrupt noise" appears. For example:
The audio file is : LJ001-0007.wav_synthesis_01.zip
My config.json file is:
I used single GPU。
Look forward your help!
I saw you used 16k sampling rate. Isn't the sampling rate 22050 for the LJSPEECH dataset? Or does it matter? What does the segment length do? Does it have to be consistent with the sampling rate?
Segment length is independent of sampling rate. It is ok to convert LJS to 16khz. Note that if training tacotron in parallel, it must have the same audio specifications.
We've shared a quick hack to decrease the fixed noise from model's bias in waveglow : https://github.com/NVIDIA/tacotron2/issues/142#issuecomment-466506044
Closing due to inactivity.
Does anybody have such a problem? When it is trained for 1000k steps with LjSpeech , the "abrupt noise" appears. For example:
The audio file is : LJ001-0007.wav_synthesis_01.zip
My config.json file is:
I used single GPU。
Look forward your help!