Restoring from checkpoint failed - w2lplus_large_8gpus_mp

GabrielLin commented 5 years ago

The pre-train model of w2lplus_large_8gpus_mp cannot be restored. When restoring, the following error shown

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key ForwardPass/w2l_encoder/conv12/bn/beta not found in checkpoint [[node save/RestoreV2 (defined at /home2/nlp/NVIDIA-OpenSeq2Seq/open_seq2seq/utils/funcs.py:198) = RestoreV2[dtypes=[DT_HALF, DT_HALF, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_HALF], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

The same issue also appears at w2lplus_large_8gpus model which I train myself.

The restoring of jasper_10x5_8gpus_mp model is OK.

By the way, Merry Christmas!

vsl9 commented 5 years ago

Can you please let us know what you are trying to do (restore the model for inference/evaluation, continue training, or transfer learning)? I tried to load the pre-trained checkpoint to evaluate it on LibriSpeech dev-clean. It works fine.

Merry Christmas and Happy New Year!

GabrielLin commented 5 years ago

Hi @vsl9 Haapy New Year! I would like to do evaluation.

Do you use the latest repo?

I download the latest repo https://github.com/NVIDIA/OpenSeq2Seq/tree/1d46bfe47a0a7c3cf8256b6b0735440f76ab2a87

And use the w2l_plus_large.tar.gz I downloaded one month ago. Its MD5 is 8D50C5D5D87ECEC122C31ACE47CF8E9C .

I ran the following command python run.py --config_file=example_configs/speech2text/w2l_large_8gpus_mp.py --mode=eval --decoder_params/use_language_model=False --use_horovod=False --num_gpus=4

The same issue was shown. Thanks.

blisc commented 5 years ago

Plus use the w2lplus_large_8gpus_mp.py example config, not w2l_large_8gpus_mp.py.

Additionally, I recommend using batch size of 1 when doing eval.

GabrielLin commented 5 years ago

@blisc Many thanks. How stupid I am.

NVIDIA / OpenSeq2Seq

Restoring from checkpoint failed - w2lplus_large_8gpus_mp #324