Reproducing parallel results

mortont commented 6 years ago

I've been trying to reproduce the parallel wavenet results, however I'm running into some issues with training the teacher model. I have trained it on the LJ Speech dataset with the default wavenet_mol.json configuration to ~280k steps (all other hparams are unchanged as well). The loss looks good, however the evaluated speech is just babbling, as if local conditioning wasn't used.

I didn't see anything immediately apparent as to why this is happening, do you have any ideas?

bfs18 commented 6 years ago

Hi @mortont Could you provide more information? For example, the loss curve, some generated samples and the code commit.

mortont commented 6 years ago

Sure, I'm on commit 67eacb995aef465d1e2ed810f25e0d7d3899e9b6 and this is the loss curve

This is one of the generated samples: gen_LJ001-0028.wav.zip For reference, this is the original that was used for the mel conditioning: LJ001-0028.wav.zip

What loss did you let the teacher get to in your examples? Wondering if this just needs more training time. Let me know if any other information would be helpful.

bfs18 commented 6 years ago

According to my experience, if a small batch size such as 2 is used, 200k training steps is not enough. Because the model haven't seen enough data. What's your batch size? Besides please ensure USE_RESIZE_CONV=False in masked.py. This is my loss curve. loss_curve

mortont commented 6 years ago

Ah, thank you. My batch size is 2 so I'll continue training. I'll go ahead and close this issue and report back when I have good samples for posterity.

bfs18 commented 6 years ago

Hi @mortont , I'm sorry that the wavenet results cannot be reproduced with the default config json. Recently I cannot reporduce the previous result either. And I finally figure out that weight normalization harms the wavenet performance. So setting use_weigh_norm=false in wavenet_mol.json may solve your problem. Initially weight normalization produces some promising results for parallel wavenet. So I keep it as a default configuration. Probabily it has negative effects on the learned mel-sepctrum representation. And good mel-spectrum representation is vital for a good modle, since the wave sample points totaly depend on theit. I'm also doing simaliar tests on parallel wavenet. Once again, I'm sorry.

mortont commented 6 years ago

Good catch @bfs18, thank you! I'll try training with use_weight_norm as false. Unfortunately my training is a bit slow with a single GPU. Do you have a pretrained teacher model to share? I'd be happy to help on any parts of parallel wavenet once I have a good teacher.

mortont commented 6 years ago

Actually, I just saw the updated readme... I'll check that out now.

bfs18 commented 6 years ago

Hi, I tested the code, setting use_weigth_norm=False solves the problem. And I implemented http://export.arxiv.org/abs/1807.07281 at this weekend. I will update the code after some testing.

zhang-jian commented 6 years ago

I am getting error when loading the pre-trained model https://drive.google.com/open?id=13rHT6zr2sXeedmjUOpp6IVQdT30cy66_

  File "/scratch/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/tensorflow-1.6.0-wm4rkg4qsenbhfdz7fzlopz6qcxckfff/lib/python3.4/site-packages/tensorflow/python/training/saver.py", line 1812, in latest_checkpoint
    if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
  File "/scratch/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/tensorflow-1.6.0-wm4rkg4qsenbhfdz7fzlopz6qcxckfff/lib/python3.4/site-packages/tensorflow/python/lib/io/file_io.py", line 337, in get_matching_files
    for single_filename in filename
  File "/scratch/spack/opt/spack/linux-centos7-x86_64/gcc-4.8.5/tensorflow-1.6.0-wm4rkg4qsenbhfdz7fzlopz6qcxckfff/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: /data/logs-wavenet/eval_ckpt/ns_pwn-n_MU-WN_DDI_mfinit-n_LOGS-n_CLIP-MAG-L2-06_27; No such file or directory

Any idea?

bfs18 commented 6 years ago

Hi @zhang-jian, I'm sorry for that. The first line of ns_pwn-eval/checkpoint contains the absolute path of the model checkpoint. You can modify the path according to your file system. Or you can only keep the basename of that path. Refer to ns_wn-eval/checkpoint for an example.

mortont commented 6 years ago

I was able to confirm that the pretrained model produces a recognizable voice, all that was needed was changing the path in the checkpoint file to a relative one. Great work!

@bfs18 ClariNet looks very interesting, mostly because of the more numerically stable implementation. I've noticed that parallel wavenet optimization is very difficult and unstable, so hopefully ClariNet helps with that.

bfs18 commented 6 years ago

I got some new examples running with contrastive loss and without weight normalization at step 70948. The result may improve a bit after longer running. gen_LJ001-0001 gen_LJ001-0001

loss4 loss3 loss2 loss1

bfs18 commented 6 years ago

I look deeper into the problem. Some bad configuration (e.g. weight normalization + tanh trans conv act) may cause the activation of the transposed convolution layer saturate. So the mel condition becomes meaningless. The model degenerate to an unconditional one. The following figures are the histogram and spectrum of the transposed convolution stack output. This model only generate random speech even though mel condition is used. after_act2_hist In contrast, the following figures come from an OK model. Most of the activation values are close to 0. So the learned representation may be considered sparse. I think there are 2 solutions.

use activation functions that would not saturate, e.g. leaky_relu in Clarinet.
When teacher forcing is used, the model can predict the original waveform totally conditioned on the teacher forcing input. The teacher forcing input contains the complete information. So we can use dropout to make the teacher forcing input incomplete. Then the model is forced to access the additional mel condition data to predict the original waveform. I'm not sure to which layers dropout should be applied. I am working on this.
add noise to teacher forcing inputs. The previous adding noise implementation is buggy. Because I added noise both to inputs and outputs. So the predicted wave is noisy. I will fix this.

EdisonChen726 commented 6 years ago

Hi, I am running the eval_parallel_wavenet.py, after 60K training, it can generate the audio with content, however, the sound is quite light, is this problem related to the power loss. Beside, the config does not include contrastive loss, how should I set this parameter?

bfs18 commented 6 years ago

Hi @EdisonChen726 I uploaded the model with contrastive loss. You can find the configuration json in the package. https://drive.google.com/open?id=1AtofQdXbSutb-_ZWFeA_I17NR2i8nUC7

EdisonChen726 commented 6 years ago

@bfs18 thank you for the fast reply, I will try it asap

bfs18 commented 6 years ago

Updated Clarinet vocoder results. Clarinet results has similar noise compared to pwn results. So I think the noise comes from the power loss term. Comapred to the teacher result, the student result does not have clear formats between 1000Hz to 3000Hz. This may be the source of the noise in the waves generated by the student. teacher spec wn1 student spec pwn1 The priority frequencies loss implemented in keithito/tacotron may relivate the problem.

EdisonChen726 commented 6 years ago

@bfs18 hi, have you meet the problem of the very slight audio result, I need to turn up the volume to very high so that I can hear the voice, do you have any idea why this happen? the volume of teacher model's result is good, but the pwn is not.

bfs18 commented 6 years ago

Hi @EdisonChen726 Setting use_mu_law=True would cause low volume when training parallel wavenet.
It is caused by clip_quant_scale function in L13 in wavenet/parallelgen.py. I don't know how to solve the problem. You can test it with the following code.

import numpy as np
import librosa 

def inv_mu_law_numpy(x, mu=255.0):
    x = np.array(x).astype(np.float32)
    out = (x + 0.5) * 2. / (mu + 1)
    out = np.sign(out) / mu * ((1 + mu) ** np.abs(out) - 1)
    out = np.where(np.equal(x, 0), x, out)
    return out

def cast_quantize_numpy(x, quant_chann):
    x_quantized = x * quant_chann / 2
    return x_quantized.astype(np.int32)

audio, _ = librosa.load('test_data/test.wav', sr=16000)
audio_int = cast_quantize_numpy(audio, 2 ** 8)
audio_ = inv_mu_law_numpy(audio_int)
librosa.output.write_wav('test_data/test_inv.wav', audio_, sr=16000)

The volume of the output wave becomes very low.

EdisonChen726 commented 6 years ago

@bfs18 get it! thank you so much. Right now I have another problem, when I add the contrastive loss with 0.3, there will be the error as: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[4,64,1,7680] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Do you know how to solve it?

bfs18 commented 6 years ago

@EdisonChen726 just try smaller batch size.

switchzts commented 6 years ago

@bfs18 Hi,I use your model(wavenet_mol without pwn) to test synthesized speech, the mute part will become a murmur, and the non-mute part is normal. Do you know why? Is it because of the trim at the time of training?

bfs18 commented 6 years ago

Hi @switchzts The use_mu_law + mol waves are much cleaner at mute part. However no_mu_law + mol waves are just as you say. So 200k steps may be not enough to train a good no_mu_law + mol model. I am not sure whether trim is a problem.

EdisonChen726 commented 6 years ago

@bfs18 Hi, I tried to set the batch size as 1, but the same error happened.

HallidayReadyOne commented 6 years ago

Hi @bfs18 , I have some questions about initialization.

Why the scale_params bias init by -0.3? Is this an experience value? And why not use log scale in student net? https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/wavenet/parallel_wavenet.py#L243 https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/wavenet/parallel_wavenet.py#L92
In readme, you mentioned "Proper initial mean_tot and scale_tot values have positive impact on model convergence and numerical stability. According to the LJSpeech data distribution, proper initial values for mean_tot and scale_tot should be 0.0 and 0.05. I modified the initializer to achieve it." In parallel_wavenet.py mean_tot and scale_tot are init by 0 and 1, which initializer is modified to achieve proper initial values for mean_tot and scale_tot (0.0 and 0.05). https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/wavenet/parallel_wavenet.py#L276

Thank you!

bfs18 commented 6 years ago

Hi @EdisonChen726 What's you gpu memory size? I only run the code on gpu with 12 gb memory or more.

bfs18 commented 6 years ago

Hi @HallidayReadyOne --Why the scale_params bias init by -0.3? Is this an experience value? Yes. I wrote some memos on why chose this value and why not use log scale in the comments in test_scale Let me know if you need further explaination.

HallidayReadyOne commented 6 years ago

Hi @bfs18, thanks for the kindly reply. I still need some guidance. In test_scale.py, you set the input data by a normally distributed with mean 0.0 and std 1.0, because after data dependent initialization for weight normalization, the output of conv/deconv is approximately normally distributed with mean 0.0 and std 1.0? https://github.com/bfs18/nsynth_wavenet/blob/4370294c8c088d3bc9e8b8486e75af9fe7f845cb/tests/test_scale.py#L137 However, you also set use_weight_normalization = False for both wn&pwn. If use_weight_normalization = False. Is this assumption still true (the output of conv/deconv is approximately normally distributed with mean 0.0 and std 1.0)?

bfs18 commented 6 years ago

Hi @HallidayReadyOne You are right this value is picked when use_weight_norm=True. Since it is chosen by experience, it is not that strict. When setting use_weight_norm=False, the initial scale is still small enough. So I keep this value.

HallidayReadyOne commented 6 years ago

Thanks @bfs18, another question about init is, In readme, you mentioned "Proper initial mean_tot and scale_tot values have positive impact on model convergence and numerical stability. According to the LJSpeech data distribution, proper initial values for mean_tot and scale_tot should be 0.0 and 0.05. I modified the initializer to achieve it." Could you please explain a little bit about how this is achieved?

EdisonChen726 commented 6 years ago

@bfs18 my gpu memory size is also 12GB

HallidayReadyOne commented 6 years ago

Hi @bfs18 , you mentioned earlier that setting use_mu_law=True would cause low volume when training parallel wavenet (I also encountered the same problem).

This may not be caused by the clip_quant_scale function in L13 in wavenet/parallelgen.py?

I test it with the following code:

audio, _ = librosa.load('test_data/test.wav', sr=16000) audio_mu = mu_law_numpy(audio) audio_mu_scaled = np.asarray(audio_mu, np.float32) / (2 8 / 2.) audio_int = cast_quantize_numpy(audio_mu_scaled, 2 8) audio_ = inv_mu_law_numpy(audio_int) librosa.output.write_wav('test_data/testinv.wav', audio, sr=16000)

Because when use_mu_law=True set to true, the output of the student model is scaled mu-law data, so the two lines of code in bold need to be added for testing correctness of clip_quant_scale function?

The volume of the output does not become lower, which show the low volume output may cause by the output of student model instead of post processing in wavenet/parallelgen.py?

bfs18 commented 6 years ago

@HallidayReadyOne

The output distribution of the student should be close to that of the teacher. So if mu-law encoding is used in the teacher, the student should also predicts mu-law encoded signal.
Since "the output of the student model is scaled mu-law data", you should not apply mu-law encoding once again on that data. So the 2 lines in bold is not correct. If you would like to use the 2 lines, you should make sure that you signal is raw audio signal in range [-1, 1).

bfs18 commented 6 years ago

Hi @EdisonChen726

When using contrastive loss, it need more gpu memory to calculate the gradients for the contrastive loss term.
I run x server on the intel gpu, so all of the 12GB nvidia ram can be used for running TF programs. If you don't have a gpu with larger memory. You can try this. I can run the experiment when using contrastive loss and setting batch size to 1. The program consumes 9643 MB gpu memory.

HallidayReadyOne commented 6 years ago

Hi @bfs18 ,

Since "the output of the student model is scaled mu-law data", in the best case, the output of student model equal to " audio, _ = librosa.load('test_data/test.wav', sr=16000) audio_mu = mu_law_numpy(audio) audio_mu_scaled = np.asarray(audio_mu, np.float32) / (2 8 / 2.) " So, as long as the student model is well trained, its output will be close to audio_mu_scaled? Then execute the following code, " audio_int = cast_quantize_numpy(audio_mu_scaled, 2 8) audio_ = inv_mu_law_numpy(audio_int) librosa.output.write_wav('test_data/testinv.wav', audio, sr=16000) " the volume of audio_ should be kept?

bfs18 commented 6 years ago

Hi @HallidayReadyOne

audio, _ = librosa.load('test_data/test.wav', sr=16000)
audio_mu = mu_law_numpy(audio)
audio_mu_scaled = np.asarray(audio_mu, np.float32) / (2 ** 8 / 2.)

is input preprocess.

audio_int = cast_quantize_numpy(audio_mu_scaled, 2 ** 8)
audio_ = inv_mu_law_numpy(audio_int)
librosa.output.write_wav('test_data/test_inv.wav', audio_, sr=16000)

is output postprocess. As far as I am considered, the encoding decoding logic in the code is correct.The volume should be kept if everything goes well. Obviously something is wrong.

HallidayReadyOne commented 6 years ago

@bfs18 , I totally agree with you. My view was that the inaccurate output of student model should be the main cause of the volume drop instead of postprocess in wavenet/parallelgen.py. I will try to solve it, progress will timely inform you. Please let me know if you have any new idea about this problem. Thanks.

bfs18 commented 6 years ago

Shared deconvolution stack can reconstruct the formats between 1000 and 3000 Hz. share_deconv

When using separate deconvolution stacks, the tanh activation functions of the first 2 deconvolution stack turn to saturate, then the condition become meaningless. Shared deconvolution can eliminate this problem. At least every iaf stack receives a meaningful conditional representation. share_h I also tried use the teacher's deconv stack weights to the student and not update it during training the student. The fromants are much clear than separate deconv stacks.

switchzts commented 6 years ago

Shared deconvolution stack can reconstruct the formats between 1000 and 3000 Hz.

Why is my output a meaningless vocal? Instead of a sentence？Do you have any idea about it?

switchzts commented 6 years ago

@bfs18 Should I set "use_weight_norm": true to false?

bfs18 commented 6 years ago

@switchzts Hi, set it to False and use leaky_relu as act_func, then the net would be easier to train. I will update a stabler and easier to train default config json.

WendongGan commented 6 years ago

@bfs18 Thanks very much for your share! After your commmiting your latest code ( commit 3fa872b ),on Oct,22, I tried to reproduce it . I trained the teacher model. The result is better! But there is a few problems. I have trained for 608k, my batch size is 4,the loss is as follow:

image

The result is as follow:

my audio my-result-and-ori.zip

Where the mel spectrum is very dense, there's some noise.Have you ever had a problem like this? Do you have any idea? Will this affect the student network? Look forward for your reply and help! Thank you! I also hope other experienced friends can discuss together。@switchzts

bfs18 / nsynth_wavenet

Reproducing parallel results #8