mychina75 commented 4 years ago

Hi, thank you for your great work. I'd like to train your model on COCO db, any suggestion for dataset preparing and training tips? thanks~

MaybeShewill-CV commented 4 years ago

@mychina75 The origin paper has training details about coco-stuff dataset:)

mychina75 commented 4 years ago

in the paper, 150K, 10K, 20K iterations for the Cityscapes dataset, CamVid dataset, and COCO-Stuff datasets respectively.... but image number of COCO db is much larger than Cityscapes.. why the iterations so small? maybe something wrong?

MaybeShewill-CV commented 4 years ago

@mychina75 That's a problem which you may get a satisfied answer at https://github.com/ycszen/BiSeNet (The origin auther's repo) :)

mychina75 commented 4 years ago

thank you. I will check. and There is a error report about resume training... plz check.

################## 2020-05-25 16:52:38.994 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:init:229 - Initialize human bisenetv2 multi gpu trainner complete 2020-05-25 16:52:41.706 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:319 - => Restoring weights from: ./model/coco_human/bisenetv2/ ... WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. 2020-05-25 16:52:42.368599: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint 2020-05-25 16:52:42.376 | ERROR | trainner.human_bisenetv2_multi_gpu_trainner:train:332 - Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found. (0) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint [[node loader_and_saver/save/RestoreV2 (defined at /opt/project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]] (1) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint [[node loader_and_saver/save/RestoreV2 (defined at /opt/project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]] [[loader_and_saver/save/RestoreV2/_37]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'loader_and_saver/save/RestoreV2': File "tools/train_bisenetv2_human.py", line 40, in train_model() File "tools/train_bisenetv2_human.py", line 27, in train_model worker = multi_gpu_trainner.BiseNetV2HumanMultiTrainer() #MultiTrainer() File "/opt/project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py", line 201, in init self._loader = tf.train.Saver(self._net_var) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 825, in init self.build() File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 837, in build self._build(self._filename, build_save=True, build_restore=True) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 875, in _build build_restore=build_restore) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 508, in _build_internal restore_sequentially, reshape) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps restore_sequentially) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2 name=name) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "/opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()

2020-05-25 16:52:42.377 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:333 - => Can not load pretrained model weights: ./model/coco_human/bisenetv2/ 2020-05-25 16:52:42.377 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:334 - => Now it starts to train BiseNetV2 from scratch ...

MaybeShewill-CV commented 4 years ago

@mychina75 Which ckpt file did you use to do resume training?

mychina75 commented 4 years ago

I set the model_checkpoint_path as "./model/coco_human/bisenetv2/" and make some changes in restore: ckpt = tf.train.get_checkpoint_state(os.path.dirname(self._initial_weight)) self._loader.restore(self._sess, ckpt.model_checkpoint_path) #moself._initial_weight)

the original code: 'self._loader.restore(self._sess, self._initial_weight)' not work for the SNAPSHOT_PATH: './model/coco_human/bisenetv2/human_train_miou=0.4369.ckpt-1.index' either...

MaybeShewill-CV commented 4 years ago

@mychina75 The snapshot file path should be ./model/coco_human/bisenetv2/human_train_miou=0.4369.ckpt-1 instead:)

mychina75 commented 4 years ago

额... 还是这个错误，Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint

################ 2020-05-26 09:39:59.213 | INFO | trainner.human_bisenetv2_multi_gpu_trainner:train:319 - => Restoring weights from: ./model/coco_human/bisenetv2/human_train_miou=0.4369.ckpt-1 ... WARNING:tensorflow:From /opt/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. 2020-05-26 09:39:59.880135: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint 2020-05-26 09:39:59.928 | ERROR | trainner.human_bisenetv2_multi_gpu_trainner:train:332 - Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found. (0) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint [[node loader_and_saver/save/RestoreV2 (defined at /project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]] (1) Not found: Key BiseNetV2/aggregation_branch/guided_aggregation_block/aggregation_features/aggregation_feature_output/bn/beta/Momentum not found in checkpoint [[node loader_and_saver/save/RestoreV2 (defined at /project/semantic_segmentation/bisenetv2-tensorflow-master/trainner/human_bisenetv2_multi_gpu_trainner.py:201) ]] [[loader_and_saver/save/RestoreV2/_223]]

MaybeShewill-CV commented 4 years ago

@mychina75 你的ckpt文件怎么生成的？确定ckpt文件的路径没有输入错误吗。你这个错误就是ckpt模型文件和当前的计算图模型不匹配:)

mychina75 commented 4 years ago

模型保存没改呀，就在xxx_gpu_trainner.py里面

define saver and loader

    with tf.variable_scope('loader_and_saver'):
        self._net_var = [vv for vv in tf.global_variables() if 'lr' not in vv.name]
        self._loader = tf.train.Saver(self._net_var)
        self._saver = tf.train.Saver(max_to_keep=5)

restore在这里： if CFG.TRAIN.RESTORE_FROM_SNAPSHOT.ENABLE: try: LOG.info('=> Restoring weights from: {:s} ... '.format(self._initial_weight)) self._loader.restore(self._sess, self._initial_weight) ...

是不是跟FREEZE_BN的设置有关，默认ENABLE: False 代码里面有判断：

define moving average op

    with tf.variable_scope(name_or_scope='moving_avg'):
        if CFG.TRAIN.FREEZE_BN.ENABLE:
            train_var_list = [
                v for v in tf.trainable_variables() if 'beta' not in v.name and 'gamma' not in v.name
            ]
        else:
            train_var_list = tf.trainable_variables()

需要单独保存一下这个参数？

MaybeShewill-CV commented 4 years ago

@mychina75 默认是不freeze bn的你如果使用的是训练过程中保存的ckpt文件的话不应该有这个问题。如果你使用的是预测过程中保存的ckpt文件那么会出现这个问题。这个我之前都是自己试用过的，没有问题，下来有时间我再测试下:)

MaybeShewill-CV commented 4 years ago

@mychina75 还有就是你能不能提供更详细的能复现你的问题的过程。比如你修改了代码的什么地方，然后怎么开始训练的，怎么保存参数，怎么开始restore weights的：）

mychina75 commented 4 years ago

解决了，改了下*_gpu_trainner.py的这个地方。貌似有些变量没有存下来，改了以后.meta文件从7.35MB变到了9.09MB。应该不会影响pb文件。

define saver and loader

    with tf.variable_scope('loader_and_saver'):
        self._net_var = [vv for vv in tf.global_variables() if 'lr' not in vv.name]
        self._loader = tf.train.Saver(self._net_var)
        self._saver = tf.train.Saver(**tf.global_variables()**, max_to_keep=5)
                                                  ---------------------- 》
    with tf.variable_scope('loader_and_saver'):
        self._net_var = [vv for vv in tf.global_variables() if 'lr' not in vv.name]
        self._loader = tf.train.Saver(self._net_var)
        self._saver = tf.train.Saver(max_to_keep=5)

MaybeShewill-CV commented 4 years ago

@mychina75 好滴:)

MaybeShewill-CV / bisenetv2-tensorflow

about coco db training #6

define saver and loader

define moving average op

define saver and loader