GeorgeSeif / Semantic-Segmentation-Suite

Semantic Segmentation Suite in TensorFlow. Implement, train, and test new Semantic Segmentation models easily!
2.51k stars 880 forks source link

Fail to load model checkpoint weights when using other frontends than ResNet101. #148

Open akhaloo opened 5 years ago

akhaloo commented 5 years ago

I have tried to train GCN model using ResNet152. However, when I tried to test the model it gave me the following error. I also had the same issue with the MobileNetV2.

Loading model checkpoint weights ... 2018-11-19 11:19:02.259938: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key resnet_v2_101/block1/unit_1/bottleneck_v2/conv1/BatchNorm/beta not found in checkpoint Traceback (most recent call last): File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1292, in _do_call return fn(*args) File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1277, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1367, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: Key resnet_v2_101/block1/unit_1/bottleneck_v2/conv1/BatchNorm/beta not found in checkpoint [[{{node save_1/RestoreV2}} = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]] [[{{node save_1/RestoreV2/_307}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_312_save_1/RestoreV2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

GeorgeSeif commented 5 years ago

Hi @akhaloo Are you using the latest code?

akhaloo commented 5 years ago

Hi George, Yes, I am using the latest one. However, I have this problem whenever I use other frontends than ResNet101.

On Sat, Nov 24, 2018 at 11:26 AM George Seif notifications@github.com wrote:

Hi @akhaloo https://github.com/akhaloo Are you using the latest code?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GeorgeSeif/Semantic-Segmentation-Suite/issues/148#issuecomment-441379077, or mute the thread https://github.com/notifications/unsubscribe-auth/Amf9HP304R6t7HbBcBQ0pRZExmrAetUvks5uyXMygaJpZM4Ypemq .

kirgal commented 5 years ago

I had same problem today with DeepLabV3+ with ResNet152. I found problem is in predict.py because frontend is not passed to model_builder() and ResNet101 is used as default. Here is how I fixed it for me in predict.py and now works well

network, _ = model_builder.build_model(args.model, net_input=net_input,
                    frontend="ResNet152",
                                        num_classes=num_classes,
                                        crop_width=args.crop_width,
                                        crop_height=args.crop_height,
                                        is_training=False)
kirgal commented 5 years ago

You can try same fix in test.py

developpeur3d commented 5 years ago

I have the same issue, I changed the size of the images and it keep giving me errors because of checkpoints.

Preparing the model ... 2019-03-01 17:25:06.884561: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open models/resnet_v2_101.ckpt: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? Traceback (most recent call last): File "/opt/pycharm-edu-2018.1.2/helpers/pydev/pydev_run_in_console.py", line 52, in run_file pydev_imports.execfile(file, globals, locals) # execute the script File "/opt/pycharm-edu-2018.1.2/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/home/deepseasegm/projet/codes/contrib/Semantic-Segmentation-Suite-master/train.py", line 97, in <module> network, init_fn = model_builder.build_model(model_name=args.model, frontend=args.frontend, net_input=net_input, num_classes=num_classes, crop_width=args.crop_width, crop_height=args.crop_height, is_training=True) File "/home/deepseasegm/projet/codes/contrib/Semantic-Segmentation-Suite-master/builders/model_builder.py", line 87, in build_model num_classes=num_classes, is_training=is_training) File "/home/deepseasegm/projet/codes/contrib/Semantic-Segmentation-Suite-master/models/DeepLabV3.py", line 76, in build_deeplabv3 logits, end_points, frontend_scope, init_fn = frontend_builder.build_frontend(inputs, frontend, pretrained_dir=pretrained_dir, is_training=is_training) File "/home/deepseasegm/projet/codes/contrib/Semantic-Segmentation-Suite-master/builders/frontend_builder.py", line 19, in build_frontend init_fn = slim.assign_from_checkpoint_fn(model_path=os.path.join(pretrained_dir, 'resnet_v2_101.ckpt'), var_list=slim.get_model_variables('resnet_v2_101'), ignore_missing_vars=True) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 680, in assign_from_checkpoint_fn reader = pywrap_tensorflow.NewCheckpointReader(model_path) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 302, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file models/resnet_v2_101.ckpt: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? PyDev console: using IPython 6.5.0

mengxia1994 commented 5 years ago

i meet the same problem, is it still not solved?

mengxia1994 commented 5 years ago

I found the reason. After the training, the model ckpt will be stored in several profiles. However, the most importment one, which has the suffix 'ckpt', is default invisible. The problem is caused by tensorflow version, but actually the auther wrote the code correctly.

What we need to do, for example, is to use predict.py like below: python predict.py --image parking_img/B2.jpg --checkpoint_path checkpoints/latest_model_FC-DenseNet56_CamVid.ckpt --model FC-DenseNet56 I think most guys write the checkpoint_path wrong.

you can refer at: https://stackoverflow.com/questions/41048819/how-to-restore-a-model-by-filename-in-tensorflow-r12

https://votec.top/2016/12/24/tensorflow-r12-tf-train-Saver/