barronalex / Tacotron

Implementation of Google's Tacotron in TensorFlow
236 stars 80 forks source link

different results output during training compared to test.py #15

Open mrelich opened 7 years ago

mrelich commented 7 years ago

I'm trying to reproduce some of the results I obtained during training by using the test.py script. Continuing to dig into this, but wondering if anyone else has come across the same issue?

mrelich commented 7 years ago

So I think I have tracked it down to the helper function (decoder_helper) in the decoder part of tacotron, but I'm still a bit of a novice so I don't really understand how to fix it. I've isolated it here by running test with the training data and selectively turning off specific areas that utilize the train bool. Going to keep digging, but I think this will be necessary to fix for everyone that wants to actually use the models they train 😄

onyedikilo commented 7 years ago

I opened an issue regarding the same problem, and I was told that In evaluation, unlike training, the output of the decoder fed back into the decoder input of next time step, so the result can be different. Closed the issue after that.

mrelich commented 7 years ago

My concern is that the InferenceHelper might not be doing what we think... Maybe the author can help out to explain why the EmbeddingHelper (either greedy or sampling) wasn't used?

Essentially I'm struggling to understand how it can go from sounding very good during training to unintelligible during inference...

Edit: Of course not using embedding layer... it's a float 😝

mrelich commented 7 years ago

@onyedikilo Ok, now I see your closed issue and how you ran into the same problem. This level of drop in audio quality just feels wrong. If you (or anyone else) checking into this finds anything, I would be super interested to know.

jpdz commented 7 years ago

Hi, I met the same problem. Do you all have any suggestions? Thanks a lot.

mrelich commented 7 years ago

Hi @jpdz, I have stepped away from working on this for the moment, but will return in a few weeks. I don't yet understand fully what the CustomHelper function that is used in the decoder actually does. In the paper, it says it passes the information at timestep t to timestep t+1, but to me it looks like at timestep t+1, it can see the entire input... But I'm probably missing something.

One idea I had is to use the standard decoder which has an embedding layer. One could multiply the mel spectrogram coefficients to a large number and then cast to an int. This would allow the use of an embedding layer, which is implemented in tensorflow and has some documentation. I will report back after trying this, but it will likely take me a month to get to it.

barronalex commented 7 years ago

Hey, very sorry I'm only just getting back to you all on this.

Although I agree it's annoying, it does make sense that there's a big drop off in quality even on the same prompt when running train.py vs test.py.

As in the original paper, the repo does not use scheduled sampling (although it is a configurable parameter in tacotron.py).

This means that in training, the decoder is given the ground truth input at every time step. When we test we don't have access to the ground truth so the next input at each time step is the output of the previous time step. This will be much noisier since we are unlikely to have perfectly synthesized the previous time step.

I highly encourage you to try changing the scheduled sampling parameter and see if it improves performance, particularly on smaller datasets. I ran a few experiments with it but didn't have the compute to properly explore it. With scheduled sampling probability 1, the output is fed to the input every time step in training and testing so the above problem should go away. The down side is that training will be more difficult and so you may want to reduce the dropout value concurrently.

I wrote InferenceHelper since the TensorFlow seq2seq api is geared towards NLP and so only provided inference helpers which sample from an embedding -- they pick the most likely next discrete word. Here the decoder outputs a continuous function (the mel filters) and so we directly pass the previous output into the next input. That's all that the InferenceHelper class does in next_inputs_fn.

Each time step does have access to the whole sequence, but that logic is handled in the attention mechanism (AttentionWrapper).

I'll put this in the readme this week but together with the above, the best way to tell if your model is generalizing is by looking for monotonicity in the attention plots. You can see these in Tensorboard under the images tab.

jpdz commented 7 years ago

@barronalex Hi, I tried to change the scheduled sampling parameter to 0.5 and have trained for three days on one gpu, however, the results are not good. BTW, I am a little confused about the sampling parameter, what's the difference between 0.5 and 1. Thanks a lot.

barronalex commented 7 years ago

Which dataset are you training on? I just uploaded some weights trained on Nancy with r=2, scheduled sampling 0.5 which might be a good starting point.

With scheduled sampling 0.5, we use the ground truth at the next decoder input half the time, and the previous output half the time. With scheduled sampling 1, we always use the previous output and never the ground truth. This means you should get the same results for training and testing on the same input with scheduled sampling 1, but it will be harder to train the model.

jpdz commented 7 years ago

@barronalex I also use the Nancy dataset based on your previous code with scheduled sampling 0.5. It has been trained for two weeks, still doesn't converge. Did you get some nice results? Thanks a lot!

barronalex commented 7 years ago

So on the training set it still sounds poor and there's no alignment?

I ended up getting better results with r=2 rather than r=5 and so maybe try that or just pull the repo, restore my weights and continue training?

The alignment with the weights I posted is quite good but it could use more training to remove some of the noise.

The samples have been updated too so you can get a sense of their quality from that.

jpdz commented 7 years ago

@barronalex Thank you so much! I will have a look at it!

mrelich commented 7 years ago

Hi @barronalex, the audio clips do indeed sound much better. These are from inference or during treaining?

I look forward to getting back to this in a few more weeks after wrapping up some other projects. Thanks again for the extra work and for uploading your examples.

barronalex commented 7 years ago

No worries at all! Sorry it's been a while.

Those clips are from inference on unseen examples (mostly taken from Arctic and the paper examples). It sounds much better during training.

jpdz commented 7 years ago

@barronalex I listened to your updated results. it's good I think. I am now using your model and began to continue training it. However, when I run the test.py based on your weights, it shows this problem: Traceback (most recent call last): File "test.py", line 89, in test(model, config, prompts) File "test.py", line 31, in test model = model(config, batch_inputs, train=False) File "/disk5/tacotron2/models/tacotron.py", line 190, in init self.seq2seq_output, self.output = self.inference(inputs, train) File "/disk5/tacotron2/models/tacotron.py", line 131, in inference encoded = ops.CBHG(pre_out, speaker_embed, K=16, c=[128,128,128], gru_units=128) File "/disk5/tacotron2/models/ops.py", line 60, in CBHG ) for k in range(1, K+1)] File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/layers/convolutional.py", line 376, in conv1d return layer.apply(inputs) File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 492, in apply return self.call(inputs, *args, **kwargs) File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 428, in call self._assert_input_compatibility(inputs) File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 540, in _assert_input_compatibility str(x.get_shape().as_list())) ValueError: Input 0 of layer conv1d_1 is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [None, 128] Any suggestions? Thanks a lot~

barronalex commented 7 years ago

Which version of TensorFlow are you running?

jpdz commented 7 years ago

I have solved this problem since when I run the test.py, I merge the two commands into one command. However, I still have a problem when I am going to continue training the model based on what you have trained. It seems that fails to load your retrained model and continue training it. Caused by op u'save/Assign_13', defined at: File "train.py", line 126, in train(model, config) File "train.py", line 47, in train saver = tf.train.Saver(max_to_keep=3, keep_checkpoint_every_n_hours=3) File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1139, in init self.build() File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1170, in build restore_sequentially=self._restore_sequentially) File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build restore_sequentially, reshape) File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 419, in _AddRestoreOps assign_ops.append(saveable.restore(tensors, shapes)) File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 155, in restore self.op.get_shape().is_fully_defined()) File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 271, in assign validate_shape=validate_shape) File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 45, in assign use_locking=use_locking, name=name) File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op original_op=self._default_original_op, op_def=op_def) File "/disk5/vir/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in init self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [400] rhs shape= [160] [[Node: save/Assign_13 = Assign[T=DT_FLOAT, _class=["loc:@decoder/decoder/attention_wrapper/output_projection_wrapper/bias"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](decoder/decoder/attention_wrapper/output_projection_wrapper/bias/Adam_1, save/RestoreV2_13/_233)]]

Secondly, I can successfully retrain the model from global step 0, however, when it comes to save samples during training, a problem occurs: saving weights saving sample Traceback (most recent call last): File "train.py", line 126, in train(model, config) File "train.py", line 94, in train ideal = audio.invert_spectrogram(inputs['stft'][0]stft_std + stft_mean) File "/disk5/tacotron2/audio.py", line 68, in invert_spectrogram spec = reshape_frames(spec, forward=False) File "/disk5/tacotron2/audio.py", line 30, in reshape_frames signal = np.reshape(signal, (-1, int(signal.shape[1]/r))) File "/disk5/vir/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 232, in reshape return _wrapfunc(a, 'reshape', newshape, order=order) File "/disk5/vir/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc return getattr(obj, method)(args, **kwds) ValueError: cannot reshape array of size 348500 into shape (2562)

Did I made some mistakes? Thanks a lot.

barronalex commented 7 years ago

It seems like you might have the saved spectrogram with r=5. It should work if you rerun 'preprocess.py nancy' with r=2 (which is now the default in audio.py) and then trying the training again.

It's not the best design currently that you have to rerun it so I'll try and fix that soon.

jpdz commented 7 years ago

@barronalex That's the problem, thanks a lot!