Closed mychina75 closed 4 years ago
@mychina75 The origin paper has training details about coco-stuff dataset:)
in the paper, 150K, 10K, 20K iterations for the Cityscapes dataset, CamVid dataset, and COCO-Stuff datasets respectively.... but image number of COCO db is much larger than Cityscapes.. why the iterations so small? maybe something wrong?
@mychina75 That's a problem which you may get a satisfied answer at https://github.com/ycszen/BiSeNet (The origin auther's repo) :)
thank you. I will check. and There is a error report about resume training... plz check.
################## 2020-05-25 16:52:38.994 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:init:229 - Initialize human bisenetv2 multi gpu trainner complete 2020-05-25 16:52:41.706 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:319 - => Restoring weights from: ./model/coco_human/bisenetv2/ ... WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. 2020-05-25 16:52:42.368599: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint 2020-05-25 16:52:42.376 | ERROR | trainner.human_bisenetv2_multi_gpu_trainner:train:332 - Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
2 root error(s) found. (0) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint [[node loader_and_saver/save/RestoreV2 (defined at /opt/project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]] (1) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint [[node loader_and_saver/save/RestoreV2 (defined at /opt/project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]] [[loader_and_saver/save/RestoreV2/_37]] 0 successful operations. 0 derived errors ignored.
Original stack trace for 'loader_and_saver/save/RestoreV2':
File "tools/train_bisenetv2_human.py", line 40, in
2020-05-25 16:52:42.377 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:333 - => Can not load pretrained model weights: ./model/coco_human/bisenetv2/ 2020-05-25 16:52:42.377 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:334 - => Now it starts to train BiseNetV2 from scratch ...
@mychina75 Which ckpt file did you use to do resume training?
I set the model_checkpoint_path as "./model/coco_human/bisenetv2/" and make some changes in restore: ckpt = tf.train.get_checkpoint_state(os.path.dirname(self._initial_weight)) self._loader.restore(self._sess, ckpt.model_checkpoint_path) #moself._initial_weight)
the original code: 'self._loader.restore(self._sess, self._initial_weight)' not work for the SNAPSHOT_PATH: './model/coco_human/bisenetv2/human_train_miou=0.4369.ckpt-1.index' either...
@mychina75 The snapshot file path should be ./model/coco_human/bisenetv2/human_train_miou=0.4369.ckpt-1 instead:)
额... 还是这个错误,Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint
################ 2020-05-26 09:39:59.213 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:319 - => Restoring weights from: ./model/coco_human/bisenetv2/human_train_miou=0.4369.ckpt-1 ... WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. 2020-05-26 09:39:59.880135: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint 2020-05-26 09:39:59.928 | ERROR | trainner.human_bisenetv2_multi_gpu_trainner:train:332 - Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
2 root error(s) found. (0) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint [[node loader_and_saver/save/RestoreV2 (defined at /project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]] (1) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint [[node loader_and_saver/save/RestoreV2 (defined at /project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]] [[loader_and_saver/save/RestoreV2/_223]]
@mychina75 你的ckpt文件怎么生成的?确定ckpt文件的路径没有输入错误吗。你这个错误就是ckpt模型文件和当前的计算图模型不匹配:)
模型保存没改呀,就在xxx_gpu_trainner.py里面
with tf.variable_scope('loader_and_saver'):
self._net_var = [vv for vv in tf.global_variables() if 'lr' not in vv.name]
self._loader = tf.train.Saver(self._net_var)
self._saver = tf.train.Saver(max_to_keep=5)
restore在这里: if CFG.TRAIN.RESTORE_FROM_SNAPSHOT.ENABLE: try: LOG.info('=> Restoring weights from: {:s} ... '.format(self._initial_weight)) self._loader.restore(self._sess, self._initial_weight) ...
是不是跟FREEZE_BN的设置有关,默认ENABLE: False 代码里面有判断:
with tf.variable_scope(name_or_scope='moving_avg'):
if CFG.TRAIN.FREEZE_BN.ENABLE:
train_var_list = [
v for v in tf.trainable_variables() if 'beta' not in v.name and 'gamma' not in v.name
]
else:
train_var_list = tf.trainable_variables()
需要单独保存一下这个参数?
@mychina75 默认是不freeze bn的 你如果使用的是训练过程中保存的ckpt文件的话 不应该有这个问题。如果你使用的是预测过程中保存的ckpt文件那么会出现这个问题。这个我之前都是自己试用过的,没有问题,下来有时间我再测试下:)
@mychina75 还有就是你能不能提供更详细的能复现你的问题的过程。比如你修改了代码的什么地方,然后怎么开始训练的,怎么保存参数,怎么开始restore weights的:)
解决了,改了下*_gpu_trainner.py的这个地方。貌似有些变量没有存下来,改了以后.meta文件从7.35MB变到了9.09MB。应该不会影响pb文件。
with tf.variable_scope('loader_and_saver'):
self._net_var = [vv for vv in tf.global_variables() if 'lr' not in vv.name]
self._loader = tf.train.Saver(self._net_var)
self._saver = tf.train.Saver(**tf.global_variables()**, max_to_keep=5)
---------------------- 》
with tf.variable_scope('loader_and_saver'):
self._net_var = [vv for vv in tf.global_variables() if 'lr' not in vv.name]
self._loader = tf.train.Saver(self._net_var)
self._saver = tf.train.Saver(max_to_keep=5)
@mychina75 好滴:)
Hi, thank you for your great work. I'd like to train your model on COCO db, any suggestion for dataset preparing and training tips? thanks~