Open mortont opened 6 years ago
Hi @mortont Could you provide more information? For example, the loss curve, some generated samples and the code commit.
Sure, I'm on commit 67eacb995aef465d1e2ed810f25e0d7d3899e9b6
and this is the loss curve
This is one of the generated samples: gen_LJ001-0028.wav.zip For reference, this is the original that was used for the mel conditioning: LJ001-0028.wav.zip
What loss did you let the teacher get to in your examples? Wondering if this just needs more training time. Let me know if any other information would be helpful.
According to my experience, if a small batch size such as 2 is used, 200k training steps is not enough. Because the model haven't seen enough data. What's your batch size? Besides please ensure USE_RESIZE_CONV=False in masked.py. This is my loss curve.
Ah, thank you. My batch size is 2 so I'll continue training. I'll go ahead and close this issue and report back when I have good samples for posterity.
Hi @mortont , I'm sorry that the wavenet results cannot be reproduced with the default config json. Recently I cannot reporduce the previous result either. And I finally figure out that weight normalization harms the wavenet performance. So setting use_weigh_norm=false in wavenet_mol.json may solve your problem. Initially weight normalization produces some promising results for parallel wavenet. So I keep it as a default configuration. Probabily it has negative effects on the learned mel-sepctrum representation. And good mel-spectrum representation is vital for a good modle, since the wave sample points totaly depend on theit. I'm also doing simaliar tests on parallel wavenet. Once again, I'm sorry.
Good catch @bfs18, thank you! I'll try training with use_weight_norm
as false. Unfortunately my training is a bit slow with a single GPU. Do you have a pretrained teacher model to share? I'd be happy to help on any parts of parallel wavenet once I have a good teacher.
Actually, I just saw the updated readme... I'll check that out now.
Hi,
I tested the code, setting use_weigth_norm=False
solves the problem.
And I implemented http://export.arxiv.org/abs/1807.07281 at this weekend. I will update the code after some testing.
I am getting error when loading the pre-trained model https://drive.google.com/open?id=13rHT6zr2sXeedmjUOpp6IVQdT30cy66_
File "/scratch/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/tensorflow-1.6.0-wm4rkg4qsenbhfdz7fzlopz6qcxckfff/lib/python3.4/site-packages/tensorflow/python/training/saver.py", line 1812, in latest_checkpoint if file_io.get_matching_files(v2_path) or file_io.get_matching_files( File "/scratch/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/tensorflow-1.6.0-wm4rkg4qsenbhfdz7fzlopz6qcxckfff/lib/python3.4/site-packages/tensorflow/python/lib/io/file_io.py", line 337, in get_matching_files for single_filename in filename File "/scratch/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/tensorflow-1.6.0-wm4rkg4qsenbhfdz7fzlopz6qcxckfff/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: /data/logs-wavenet/eval_ckpt/ns_pwn-n_MU-WN_DDI_mfinit-n_LOGS-n_CLIP-MAG-L2-06_27; No such file or directory
Any idea?
Hi @zhang-jian, I'm sorry for that. The first line of ns_pwn-eval/checkpoint contains the absolute path of the model checkpoint. You can modify the path according to your file system. Or you can only keep the basename of that path. Refer to ns_wn-eval/checkpoint for an example.
I was able to confirm that the pretrained model produces a recognizable voice, all that was needed was changing the path in the checkpoint file to a relative one. Great work!
@bfs18 ClariNet looks very interesting, mostly because of the more numerically stable implementation. I've noticed that parallel wavenet optimization is very difficult and unstable, so hopefully ClariNet helps with that.
I got some new examples running with contrastive loss and without weight normalization at step 70948. The result may improve a bit after longer running. gen_LJ001-0001 gen_LJ001-0001
I look deeper into the problem. Some bad configuration (e.g. weight normalization + tanh trans conv act) may cause the activation of the transposed convolution layer saturate. So the mel condition becomes meaningless. The model degenerate to an unconditional one. The following figures are the histogram and spectrum of the transposed convolution stack output. This model only generate random speech even though mel condition is used. In contrast, the following figures come from an OK model. Most of the activation values are close to 0. So the learned representation may be considered sparse. I think there are 2 solutions.
Hi, I am running the eval_parallel_wavenet.py, after 60K training, it can generate the audio with content, however, the sound is quite light, is this problem related to the power loss. Beside, the config does not include contrastive loss, how should I set this parameter?
Hi @EdisonChen726 I uploaded the model with contrastive loss. You can find the configuration json in the package. https://drive.google.com/open?id=1AtofQdXbSutb-_ZWFeA_I17NR2i8nUC7
@bfs18 thank you for the fast reply, I will try it asap
Updated Clarinet vocoder results. Clarinet results has similar noise compared to pwn results. So I think the noise comes from the power loss term. Comapred to the teacher result, the student result does not have clear formats between 1000Hz to 3000Hz. This may be the source of the noise in the waves generated by the student. teacher spec student spec The priority frequencies loss implemented in keithito/tacotron may relivate the problem.
@bfs18 hi, have you meet the problem of the very slight audio result, I need to turn up the volume to very high so that I can hear the voice, do you have any idea why this happen? the volume of teacher model's result is good, but the pwn is not.
Hi @EdisonChen726
Setting use_mu_law=True would cause low volume when training parallel wavenet.
It is caused by clip_quant_scale function in L13 in wavenet/parallelgen.py. I don't know how to solve the problem.
You can test it with the following code.
import numpy as np
import librosa
def inv_mu_law_numpy(x, mu=255.0):
x = np.array(x).astype(np.float32)
out = (x + 0.5) * 2. / (mu + 1)
out = np.sign(out) / mu * ((1 + mu) ** np.abs(out) - 1)
out = np.where(np.equal(x, 0), x, out)
return out
def cast_quantize_numpy(x, quant_chann):
x_quantized = x * quant_chann / 2
return x_quantized.astype(np.int32)
audio, _ = librosa.load('test_data/test.wav', sr=16000)
audio_int = cast_quantize_numpy(audio, 2 ** 8)
audio_ = inv_mu_law_numpy(audio_int)
librosa.output.write_wav('test_data/test_inv.wav', audio_, sr=16000)
The volume of the output wave becomes very low.
@bfs18 get it! thank you so much. Right now I have another problem, when I add the contrastive loss with 0.3, there will be the error as: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[4,64,1,7680] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Do you know how to solve it?
@EdisonChen726 just try smaller batch size.
@bfs18 Hi,I use your model(wavenet_mol without pwn) to test synthesized speech, the mute part will become a murmur, and the non-mute part is normal. Do you know why? Is it because of the trim at the time of training?
Hi @switchzts The use_mu_law + mol waves are much cleaner at mute part. However no_mu_law + mol waves are just as you say. So 200k steps may be not enough to train a good no_mu_law + mol model. I am not sure whether trim is a problem.
@bfs18 Hi, I tried to set the batch size as 1, but the same error happened.
Hi @bfs18 , I have some questions about initialization.
Why the scale_params bias init by -0.3? Is this an experience value? And why not use log scale in student net? https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/wavenet/parallel_wavenet.py#L243 https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/wavenet/parallel_wavenet.py#L92
In readme, you mentioned "Proper initial mean_tot and scale_tot values have positive impact on model convergence and numerical stability. According to the LJSpeech data distribution, proper initial values for mean_tot and scale_tot should be 0.0 and 0.05. I modified the initializer to achieve it." In parallel_wavenet.py mean_tot and scale_tot are init by 0 and 1, which initializer is modified to achieve proper initial values for mean_tot and scale_tot (0.0 and 0.05). https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/wavenet/parallel_wavenet.py#L276
Thank you!
Hi @EdisonChen726 What's you gpu memory size? I only run the code on gpu with 12 gb memory or more.
Hi @HallidayReadyOne --Why the scale_params bias init by -0.3? Is this an experience value? Yes. I wrote some memos on why chose this value and why not use log scale in the comments in test_scale Let me know if you need further explaination.
Hi @bfs18, thanks for the kindly reply. I still need some guidance. In test_scale.py, you set the input data by a normally distributed with mean 0.0 and std 1.0, because after data dependent initialization for weight normalization, the output of conv/deconv is approximately normally distributed with mean 0.0 and std 1.0? https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/tests/test_scale.py#L137 However, you also set use_weight_normalization = False for both wn&pwn. If use_weight_normalization = False. Is this assumption still true (the output of conv/deconv is approximately normally distributed with mean 0.0 and std 1.0)?
Hi @HallidayReadyOne You are right this value is picked when use_weight_norm=True. Since it is chosen by experience, it is not that strict. When setting use_weight_norm=False, the initial scale is still small enough. So I keep this value.
Thanks @bfs18, another question about init is, In readme, you mentioned "Proper initial mean_tot and scale_tot values have positive impact on model convergence and numerical stability. According to the LJSpeech data distribution, proper initial values for mean_tot and scale_tot should be 0.0 and 0.05. I modified the initializer to achieve it." Could you please explain a little bit about how this is achieved?
@bfs18 my gpu memory size is also 12GB
Hi @bfs18 , you mentioned earlier that setting use_mu_law=True would cause low volume when training parallel wavenet (I also encountered the same problem).
This may not be caused by the clip_quant_scale function in L13 in wavenet/parallelgen.py?
I test it with the following code:
audio, _ = librosa.load('test_data/test.wav', sr=16000) audio_mu = mu_law_numpy(audio) audio_mu_scaled = np.asarray(audio_mu, np.float32) / (2 8 / 2.) audio_int = cast_quantize_numpy(audio_mu_scaled, 2 8) audio_ = inv_mu_law_numpy(audio_int) librosa.output.write_wav('test_data/testinv.wav', audio, sr=16000)
Because when use_mu_law=True set to true, the output of the student model is scaled mu-law data, so the two lines of code in bold need to be added for testing correctness of clip_quant_scale function?
The volume of the output does not become lower, which show the low volume output may cause by the output of student model instead of post processing in wavenet/parallelgen.py?
@HallidayReadyOne
Hi @EdisonChen726
Hi @bfs18 ,
Since "the output of the student model is scaled mu-law data", in the best case, the output of student model equal to " audio, _ = librosa.load('test_data/test.wav', sr=16000) audio_mu = mu_law_numpy(audio) audio_mu_scaled = np.asarray(audio_mu, np.float32) / (2 8 / 2.) " So, as long as the student model is well trained, its output will be close to audio_mu_scaled? Then execute the following code, " audio_int = cast_quantize_numpy(audio_mu_scaled, 2 8) audio_ = inv_mu_law_numpy(audio_int) librosa.output.write_wav('test_data/testinv.wav', audio, sr=16000) " the volume of audio_ should be kept?
Hi @HallidayReadyOne
audio, _ = librosa.load('test_data/test.wav', sr=16000)
audio_mu = mu_law_numpy(audio)
audio_mu_scaled = np.asarray(audio_mu, np.float32) / (2 ** 8 / 2.)
is input preprocess.
audio_int = cast_quantize_numpy(audio_mu_scaled, 2 ** 8)
audio_ = inv_mu_law_numpy(audio_int)
librosa.output.write_wav('test_data/test_inv.wav', audio_, sr=16000)
is output postprocess. As far as I am considered, the encoding decoding logic in the code is correct.The volume should be kept if everything goes well. Obviously something is wrong.
@bfs18 , I totally agree with you. My view was that the inaccurate output of student model should be the main cause of the volume drop instead of postprocess in wavenet/parallelgen.py. I will try to solve it, progress will timely inform you. Please let me know if you have any new idea about this problem. Thanks.
Shared deconvolution stack can reconstruct the formats between 1000 and 3000 Hz.
When using separate deconvolution stacks, the tanh activation functions of the first 2 deconvolution stack turn to saturate, then the condition become meaningless. Shared deconvolution can eliminate this problem. At least every iaf stack receives a meaningful conditional representation. I also tried use the teacher's deconv stack weights to the student and not update it during training the student. The fromants are much clear than separate deconv stacks.
Shared deconvolution stack can reconstruct the formats between 1000 and 3000 Hz.
Why is my output a meaningless vocal? Instead of a sentence?Do you have any idea about it?
@bfs18 Should I set "use_weight_norm": true to false?
@switchzts Hi, set it to False and use leaky_relu as act_func, then the net would be easier to train. I will update a stabler and easier to train default config json.
@bfs18 Thanks very much for your share! After your commmiting your latest code ( commit 3fa872b ),on Oct,22, I tried to reproduce it . I trained the teacher model. The result is better! But there is a few problems. I have trained for 608k, my batch size is 4,the loss is as follow:
image
The result is as follow:
my audio my-result-and-ori.zip
Where the mel spectrum is very dense, there's some noise.Have you ever had a problem like this? Do you have any idea? Will this affect the student network? Look forward for your reply and help! Thank you! I also hope other experienced friends can discuss together。@switchzts
I've been trying to reproduce the parallel wavenet results, however I'm running into some issues with training the teacher model. I have trained it on the LJ Speech dataset with the default wavenet_mol.json configuration to ~280k steps (all other hparams are unchanged as well). The loss looks good, however the evaluated speech is just babbling, as if local conditioning wasn't used.
I didn't see anything immediately apparent as to why this is happening, do you have any ideas?