error while training with VGGnet

hemavakade commented 7 years ago

Env details: GPU : TITAN X cuda 8.0+cudnn 5.1 tensorflow version : 0.11.0

I followed the Readme and ran the following command

python ./faster_rcnn/train_net.py --gpu 0 --weights ./data/pretrain_model/VGG_imagenet.npy --imdb voc_2007_trainval --iters 70000 --cfg  ./experiments/cfgs/faster_rcnn_end2end.yml --network VGGnet_train --set EXP_DIR exp_dir

I get the following error:

I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
kittivoc_train
kittivoc_val
kittivoc_trainval
kittivoc_test
kittivoc_train
kittivoc_val
kittivoc_trainval
kittivoc_test
kittivoc_train
kittivoc_val
kittivoc_trainval
kittivoc_test
nthu_71
nthu_370
Called with args:
Namespace(cfg_file='./experiments/cfgs/faster_rcnn_end2end.yml', gpu_id=0, imdb_name='voc_2007_trainval', max_iters=70000, network_name='VGGnet_train', pretrained_model='./data/pretrain_model/VGG_imagenet.npy', randomize=False, restore=1, set_cfgs=['EXP_DIR', 'exp_dir'], solver=None)
Using config:
{'ANCHOR_SCALES': [8, 16, 32],
 'DATA_DIR': '/home/hema/TFFRCNN/data',
 'DEDUP_BOXES': 0.0625,
 'EPS': 1e-14,
 'EXP_DIR': 'exp_dir',
 'GPU_ID': 0,
 'IS_EXTRAPOLATING': True,
 'IS_MULTISCALE': False,
 'IS_RPN': True,
 'LOG_DIR': 'faster_rcnn_voc',
 'MATLAB': 'matlab',
 'MODELS_DIR': '/home/hema/TFFRCNN/models/pascal_voc',
 'NCLASSES': 21,
 'NET_NAME': 'VGGnet',
 'PIXEL_MEANS': array([[[ 102.9801,  115.9465,  122.7717]]]),
 'REGION_PROPOSAL': 'RPN',
 'RNG_SEED': 3,
 'ROOT_DIR': '/home/hema/TFFRCNN',
 'SUBCLS_NAME': 'voxel_exemplars',
 'TEST': {'BBOX_REG': True,
          'HAS_RPN': True,
          'MAX_SIZE': 1000,
          'NMS': 0.3,
          'PROPOSAL_METHOD': 'selective_search',
          'RPN_MIN_SIZE': 16,
          'RPN_NMS_THRESH': 0.7,
          'RPN_POST_NMS_TOP_N': 300,
          'RPN_PRE_NMS_TOP_N': 6000,
          'SCALES': [600],
          'SVM': False},
 'TRAIN': {'ASPECTS': [1],
           'ASPECT_GROUPING': True,
           'BATCH_SIZE': 300,
           'BBOX_INSIDE_WEIGHTS': [1, 1, 1, 1],
           'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
           'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
           'BBOX_NORMALIZE_TARGETS': True,
           'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
           'BBOX_REG': True,
           'BBOX_THRESH': 0.5,
           'BG_THRESH_HI': 0.5,
           'BG_THRESH_LO': 0.0,
           'DISPLAY': 10,
           'DONTCARE_AREA_INTERSECTION_HI': 0.5,
           'FG_FRACTION': 0.3,
           'FG_THRESH': 0.5,
           'GAMMA': 0.1,
           'HAS_RPN': True,
           'IMS_PER_BATCH': 1,
           'KERNEL_SIZE': 5,
           'LEARNING_RATE': 0.001,
           'LOG_IMAGE_ITERS': 100,
           'MAX_SIZE': 1000,
           'MOMENTUM': 0.9,
           'OHEM': False,
           'PRECLUDE_HARD_SAMPLES': True,
           'PROPOSAL_METHOD': 'gt',
           'RPN_BATCHSIZE': 256,
           'RPN_BBOX_INSIDE_WEIGHTS': [1, 1, 1, 1],
           'RPN_CLOBBER_POSITIVES': False,
           'RPN_FG_FRACTION': 0.5,
           'RPN_MIN_SIZE': 16,
           'RPN_NEGATIVE_OVERLAP': 0.3,
           'RPN_NMS_THRESH': 0.7,
           'RPN_POSITIVE_OVERLAP': 0.7,
           'RPN_POSITIVE_WEIGHT': -1.0,
           'RPN_POST_NMS_TOP_N': 2000,
           'RPN_PRE_NMS_TOP_N': 12000,
           'SCALES': [600],
           'SCALES_BASE': [0.25, 0.5, 1.0, 2.0, 3.0],
           'SNAPSHOT_INFIX': '',
           'SNAPSHOT_ITERS': 5000,
           'SNAPSHOT_PREFIX': 'VGGnet_fast_rcnn',
           'SOLVER': 'Momentum',
           'STEPSIZE': 60000,
           'USE_FLIPPED': True,
           'USE_PREFETCH': False,
           'WEIGHT_DECAY': 0.0005},
 'USE_GPU_NMS': True}
<bound method pascal_voc.default_roidb of <lib.datasets.pascal_voc.pascal_voc object at 0x7f8916ad2590>>
Loaded dataset `voc_2007_trainval` for training
Appending horizontally-flipped training examples...
voc_2007_trainval gt roidb loaded from /home/hema/TFFRCNN/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
Output will be saved to `/home/hema/TFFRCNN/output/exp_dir/voc_2007_trainval`
Logs will be saved to `/home/hema/TFFRCNN/logs/faster_rcnn_voc/voc_2007_trainval/2017-03-07-10-29-54`
/gpu:0
Tensor("data:0", shape=(?, ?, ?, 3), dtype=float32)
Tensor("conv5_3/conv5_3:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("rpn_conv/3x3/rpn_conv/3x3:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("rpn_conv/3x3/rpn_conv/3x3:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("rpn_cls_score/rpn_cls_score:0", shape=(?, ?, ?, 18), dtype=float32)
Tensor("gt_boxes:0", shape=(?, 5), dtype=float32)
Tensor("gt_ishard:0", shape=(?,), dtype=int32)
Tensor("dontcare_areas:0", shape=(?, 4), dtype=float32)
Tensor("im_info:0", shape=(?, 3), dtype=float32)
Tensor("rpn_cls_score/rpn_cls_score:0", shape=(?, ?, ?, 18), dtype=float32)
Tensor("rpn_cls_prob:0", shape=(?, ?, ?, ?), dtype=float32)
Tensor("Reshape_2:0", shape=(?, ?, ?, 18), dtype=float32)
Tensor("rpn_bbox_pred/rpn_bbox_pred:0", shape=(?, ?, ?, 36), dtype=float32)
Tensor("im_info:0", shape=(?, 3), dtype=float32)
Tensor("rpn_rois:0", shape=(?, 5), dtype=float32)
Tensor("gt_boxes:0", shape=(?, 5), dtype=float32)
Tensor("gt_ishard:0", shape=(?,), dtype=int32)
Tensor("dontcare_areas:0", shape=(?, 4), dtype=float32)
Tensor("conv5_3/conv5_3:0", shape=(?, ?, ?, 512), dtype=float32)
Tensor("roi-data/rois:0", shape=(?, 5), dtype=float32)
[<tf.Tensor 'conv5_3/conv5_3:0' shape=(?, ?, ?, 512) dtype=float32>, <tf.Tensor 'roi-data/rois:0' shape=(?, 5) dtype=float32>]
Tensor("drop7/mul:0", shape=(?, 4096), dtype=float32)
Use network `VGGnet_train` in training
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:01:00.0
Total memory: 11.90GiB
Free memory: 11.22GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0)
hema added filename /home/hema/TFFRCNN/output/exp_dir/voc_2007_trainval
Computing bounding-box regression targets...
bbox target means:
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
[ 0.  0.  0.  0.]
bbox target stdevs:
[[ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]
 [ 0.1  0.1  0.2  0.2]]
[ 0.1  0.1  0.2  0.2]
Normalizing targets
done
Solving...
/home/hema/anaconda2/envs/rcnn/lib/python2.7/site-packages/tensorflow/python/ops/gradients.py:90: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Traceback (most recent call last):
  File "./faster_rcnn/train_net.py", line 109, in <module>
    restore=bool(int(args.restore)))
  File "./faster_rcnn/../lib/fast_rcnn/train.py", line 398, in train_net
    sw.train_model(sess, max_iters, restore=restore)
  File "./faster_rcnn/../lib/fast_rcnn/train.py", line 169, in train_model
    raise 'Check your pretrained {:s}'.format(ckpt.model_checkpoint_path)
AttributeError: 'NoneType' object has no attribute 'model_checkpoint_path'

IMO, there is no checkpoint before the model runs the first iteration. How do I get around this error?

hemavakade commented 7 years ago

Solved it by adding the following if condition in train_model function inside train.py

if restore_iter > 0: # Change added    
            if restore:
                try:
                    ckpt = tf.train.get_checkpoint_state(self.output_dir)
                    print 'Restoring from {}...'.format(ckpt.model_checkpoint_path),
                    self.saver.restore(sess, ckpt.model_checkpoint_path)
                    stem = os.path.splitext(os.path.basename(ckpt.model_checkpoint_path))[0]
                    restore_iter = int(stem.split('_')[-1])
                    sess.run(global_step.assign(restore_iter))
                    print 'done'
                except:
                    raise 'Check your pretrained {:s}'.format(ckpt.model_checkpoint_path)

CharlesShang commented 7 years ago

Great to hear that. But it's not actually been solved. TF1.0 change restore api. There's no model_checkpoint_path.

You can restore pretrained model with:

restorer = tf.train.Saver() # restoring all the vars
restorer.restore(sess, tf.train.latest_checkpoint(checkpoint_dir))

CharlesShang / TFFRCNN

error while training with VGGnet #28