Open missxa opened 7 years ago
I have the similar problem here but the missing layer is different
Traceback (most recent call last): File "recognize.py", line 103, in
saver.restore(sess, tf.train.latest_checkpoint('asset/train/ckpt')) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1439, in restore {self.saver_def.filename_tensor_name: save_path}) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: Tensor name "lyr-aconv1d_20/W" not found in checkpoint files asset/train/ckpt/model-020-45480 [[Node: save/RestoreV2_62 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_62/tensor_names, save/RestoreV2_62/shape_and_slices)]] [[Node: save/RestoreV2_157/_211 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_961_save/RestoreV2_157", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]] Caused by op u'save/RestoreV2_62', defined at: File "recognize.py", line 102, in
saver = tf.train.Saver() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1051, in init self.build() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1081, in build restore_sequentially=self._restore_sequentially) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 675, in build restore_sequentially, reshape) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 402, in _AddRestoreOps tensors = self.restore_op(filename_tensor, saveable, preferred_shard) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 242, in restore_op [spec.tensor.dtype])[0]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 441, in restore_v2 dtypes=dtypes, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2392, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1264, in init self._traceback = _extract_stack() NotFoundError(see above for traceback): Tensor name "lyr-aconv1d_20/W" not found in checkpoint files asset/train/ckpt/model-020-45480 [[Node: save/RestoreV2_62 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_62/tensor_names, save/RestoreV2_62/shape_and_slices)]] [[Node: save/RestoreV2_157/_211 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_961_save/RestoreV2_157", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Hi, @missxa
Did you check your GPU RAM while running the code?
I found out the GPU RAM needs more than 4G while initializing the model.
I guess the errors is caused by loading model imcompletely.
Could you check the result by typing nvidia-smi -l 1
in another terminal and see what's happening?
Having the same problem, running with tf 0.12.1.
I'm completely confused with tf version and sugartensor version because tf updating is so fast and google change library function name without fast version compatibility.
I'll make docker image include VCTK corpus and pre-trained weights and share.
I ran into this issue as well (though I am running on a GPU with 4GB of RAM). I found that by re-running the training myself under my setup, I was able to produce a training checkpoint that I could use to run recognize.py
successfully. I reduced the batch size to 4 as suggested, and after 20 epochs training terminated with a loss of 8.72. Running tensorflow 0.12.1 and sugartensor 0.0.2.3.
I've uploaded the resulting checkpoint to Figshare in case it's usable to anyone else, as training can take quite a long time on a less-powerful GPU: https://figshare.com/articles/speech-to-text-wavenet_VCTK_training_checkpoint/4555483
With it, I get the following for asset/data/wav48/p225/p225_003.wav
:
six spoons of fresh snow peas five thick slabs of blue ceese and maybe a snack for he brother bob
@ryanfb Thanks for your nice works.
@ryanfb Hey, I downloaded your weights and had the same problem. All of my variables seem to have the same names, but are missing the "lyr-" prefix. Do you know how I can fix this?
@jmiller656 What's the output when you run pip freeze | grep tensor
? If it doesn't match the versions I used to make my training checkpoint, that may be the problem. If it does match the versions I used, then I'm not sure what's causing this…
Here's my output:
sugartensor==0.0.2.4 tensorflow-gpu==0.12.1
@jmiller656 it worked for me when i downgraded sugartensor from 0.0.2.4 to 0.0.2.3.
sugartensor==0.0.2.3
tensorflow==0.12.1
thanks @ryanfb for this new model.
Cool, downgrading seemed to work. Thanks!
@ryanfb The transcription you mentioned for asset/data/wav48/p225/p225_003.wav
, is from the training dataset? Did you do any train / test split? I would like to get the WER for the test dataset, but I don't think how the split was made.
@giovannirescia My checkpoint was built on an earlier commit of the code which didn't seem to use held-out validation/test sets for evaluating after training on VCTK (I used that training wav for my example since it was also the example in the earlier README). The current version uses different corpora for validation/test. You're probably better off just using the latest version so you can easily pass those corpora in to test.py for validation/test.
I'm trying to use the pre-trained model provided in the readme. When I run
recognise.py
it throws the following errorI'm using tensorflow-0.12.1 Any help will be much appreciated