buriburisuri / speech-to-text-wavenet

Speech-to-Text-WaveNet : End-to-end sentence level English speech recognition based on DeepMind's WaveNet and tensorflow
Apache License 2.0
3.95k stars 794 forks source link

Missing tensor in pre-trained model #19

Open missxa opened 7 years ago

missxa commented 7 years ago

I'm trying to use the pre-trained model provided in the readme. When I run recognise.py it throws the following error

Traceback (most recent call last):
  File "recognize.py", line 103, in <module>
    saver.restore(sess, tf.train.latest_checkpoint('asset/train/ckpt'))
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/training/saver.py", line 1388, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
NotFoundError: Tensor name "lyr-conv1d_5/mean" not found in checkpoint files asset/train/ckpt/model-020-45480
     [[Node: save/RestoreV2_217 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_217/tensor_names, save/RestoreV2_217/shape_and_slices)]]

Caused by op u'save/RestoreV2_217', defined at:
  File "recognize.py", line 102, in <module>
    saver = tf.train.Saver()
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/training/saver.py", line 1000, in __init__
    self.build()
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/training/saver.py", line 1030, in build
    restore_sequentially=self._restore_sequentially)
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/training/saver.py", line 624, in build
    restore_sequentially, reshape)
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/training/saver.py", line 361, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/training/saver.py", line 200, in restore_op
    [spec.tensor.dtype])[0])
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/ops/gen_io_ops.py", line 441, in restore_v2
    dtypes=dtypes, name=name)
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/Library/Python/2.7/lib/python/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

NotFoundError (see above for traceback): Tensor name "lyr-conv1d_5/mean" not found in checkpoint files asset/train/ckpt/model-020-45480
     [[Node: save/RestoreV2_217 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_217/tensor_names, save/RestoreV2_217/shape_and_slices)]]

I'm using tensorflow-0.12.1 Any help will be much appreciated

a00achild1 commented 7 years ago

I have the similar problem here but the missing layer is different

Traceback (most recent call last): File "recognize.py", line 103, in saver.restore(sess, tf.train.latest_checkpoint('asset/train/ckpt')) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1439, in restore {self.saver_def.filename_tensor_name: save_path}) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: Tensor name "lyr-aconv1d_20/W" not found in checkpoint files asset/train/ckpt/model-020-45480 [[Node: save/RestoreV2_62 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_62/tensor_names, save/RestoreV2_62/shape_and_slices)]] [[Node: save/RestoreV2_157/_211 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_961_save/RestoreV2_157", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op u'save/RestoreV2_62', defined at: File "recognize.py", line 102, in saver = tf.train.Saver() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1051, in init self.build() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1081, in build restore_sequentially=self._restore_sequentially) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 675, in build restore_sequentially, reshape) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 402, in _AddRestoreOps tensors = self.restore_op(filename_tensor, saveable, preferred_shard) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 242, in restore_op [spec.tensor.dtype])[0]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 441, in restore_v2 dtypes=dtypes, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2392, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1264, in init self._traceback = _extract_stack()

NotFoundError(see above for traceback): Tensor name "lyr-aconv1d_20/W" not found in checkpoint files asset/train/ckpt/model-020-45480 [[Node: save/RestoreV2_62 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_62/tensor_names, save/RestoreV2_62/shape_and_slices)]] [[Node: save/RestoreV2_157/_211 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_961_save/RestoreV2_157", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

a00achild1 commented 7 years ago

Hi, @missxa Did you check your GPU RAM while running the code? I found out the GPU RAM needs more than 4G while initializing the model. I guess the errors is caused by loading model imcompletely. Could you check the result by typing nvidia-smi -l 1 in another terminal and see what's happening?

Isaac-1010 commented 7 years ago

Having the same problem, running with tf 0.12.1.

buriburisuri commented 7 years ago

I'm completely confused with tf version and sugartensor version because tf updating is so fast and google change library function name without fast version compatibility.

I'll make docker image include VCTK corpus and pre-trained weights and share.

ryanfb commented 7 years ago

I ran into this issue as well (though I am running on a GPU with 4GB of RAM). I found that by re-running the training myself under my setup, I was able to produce a training checkpoint that I could use to run recognize.py successfully. I reduced the batch size to 4 as suggested, and after 20 epochs training terminated with a loss of 8.72. Running tensorflow 0.12.1 and sugartensor 0.0.2.3.

I've uploaded the resulting checkpoint to Figshare in case it's usable to anyone else, as training can take quite a long time on a less-powerful GPU: https://figshare.com/articles/speech-to-text-wavenet_VCTK_training_checkpoint/4555483

With it, I get the following for asset/data/wav48/p225/p225_003.wav:

six spoons of fresh snow peas five thick slabs of blue ceese and maybe a snack for he brother bob
buriburisuri commented 7 years ago

@ryanfb Thanks for your nice works.

jmiller656 commented 7 years ago

@ryanfb Hey, I downloaded your weights and had the same problem. All of my variables seem to have the same names, but are missing the "lyr-" prefix. Do you know how I can fix this?

ryanfb commented 7 years ago

@jmiller656 What's the output when you run pip freeze | grep tensor? If it doesn't match the versions I used to make my training checkpoint, that may be the problem. If it does match the versions I used, then I'm not sure what's causing this…

jmiller656 commented 7 years ago

Here's my output:

sugartensor==0.0.2.4 tensorflow-gpu==0.12.1

fazalWahid56 commented 7 years ago

@jmiller656 it worked for me when i downgraded sugartensor from 0.0.2.4 to 0.0.2.3.

sugartensor==0.0.2.3
tensorflow==0.12.1

thanks @ryanfb for this new model.

jmiller656 commented 7 years ago

Cool, downgrading seemed to work. Thanks!

giovannirescia commented 7 years ago

@ryanfb The transcription you mentioned for asset/data/wav48/p225/p225_003.wav , is from the training dataset? Did you do any train / test split? I would like to get the WER for the test dataset, but I don't think how the split was made.

ryanfb commented 7 years ago

@giovannirescia My checkpoint was built on an earlier commit of the code which didn't seem to use held-out validation/test sets for evaluating after training on VCTK (I used that training wav for my example since it was also the example in the earlier README). The current version uses different corpora for validation/test. You're probably better off just using the latest version so you can easily pass those corpora in to test.py for validation/test.