Bugs with 3d_detector.py

MuMuJun97 commented 4 years ago

Hi! Thank you very much for sharing the code for your paper!

I deployed your code in my own environment and also wrote the program that reads the JRDB dataset and released it via ROS. When I run 3d_detector.py, I can get into the get_3d_feature() callback function.

However, my program would crash and abort immediately after running. When I use breakpoints for debugging, the place where I find the program aborts is here:

# jpda_rospack/src/featurepointnet_model.py
try:
    batch_centers, \
    batch_heading_scores, batch_heading_residuals, \
    batch_size_scores, batch_size_residuals, batch_features = \
    self.sess.run([self.ops['center'],
                   ep['heading_scores'], ep['heading_residuals'],
                   ep['size_scores'], ep['size_residuals'], self.ops['depth_feature']],
                   feed_dict=feed_dict)
except Exception as e:
     print(e)

and the error message is as follows:

[INFO] [1600131005.769569]: 3D detector ready.
2020-09-15 08:50:08.476800: E tensorflow/stream_executor/cuda/cuda_dnn.cc:378] Loaded runtime CuDNN library: 7102 (compatibility version 7100) but source was compiled with 6021 (compatibility version 6000).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
2020-09-15 08:50:08.477254: W ./tensorflow/stream_executor/stream.h:1939] attempting to perform DNN operation using StreamExecutor without DNN support
cuDNN launch failure : input shape ([4,1024,1,64])
     [[Node: conv1/bn/cond/FusedBatchNorm_1 = FusedBatchNorm[T=DT_FLOAT, data_format="NHWC", epsilon=0.001, is_training=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](conv1/bn/cond/FusedBatchNorm_1/Switch, conv1/bn/cond/FusedBatchNorm_1/Switch_1, conv1/bn/cond/FusedBatchNorm_1/Switch_2, conv1/bn/cond_1/AssignMovingAvg/sub/Switch, conv1/bn/cond_1/AssignMovingAvg_1/sub/Switch)]]
Caused by op 'conv1/bn/cond/FusedBatchNorm_1', defined at:
  File "/home/zlin/software/pycharm-2020-2/plugins/python/helpers/pydev/pydevd.py", line 2141, in <module>
    main()
  File "/home/zlin/software/pycharm-2020-2/plugins/python/helpers/pydev/pydevd.py", line 2132, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/home/zlin/software/pycharm-2020-2/plugins/python/helpers/pydev/pydevd.py", line 1441, in run
    return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
  File "/home/zlin/software/pycharm-2020-2/plugins/python/helpers/pydev/pydevd.py", line 1448, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/zlin/software/pycharm-2020-2/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/zlin/PycharmProjects/MOT/catkin_ws_jrmot/src/jpda_rospack/src/3d_detector.py", line 203, in <module>
    main(sys.argv)
  File "/home/zlin/PycharmProjects/MOT/catkin_ws_jrmot/src/jpda_rospack/src/3d_detector.py", line 197, in main
    Detector_3d()
  File "/home/zlin/PycharmProjects/MOT/catkin_ws_jrmot/src/jpda_rospack/src/3d_detector.py", line 51, in __init__
    self.depth_model = create_depth_model('FPointNet', fpointnet_config)
  File "/home/zlin/PycharmProjects/MOT/catkin_ws_jrmot/src/jpda_rospack/src/featurepointnet_model.py", line 339, in create_depth_model
    return FPointNet(config_path)
  File "/home/zlin/PycharmProjects/MOT/catkin_ws_jrmot/src/jpda_rospack/src/featurepointnet_model.py", line 28, in __init__
    end_points, depth_feature = self.get_model(pointclouds_pl, one_hot_vec_pl, is_training_pl)
  File "/home/zlin/PycharmProjects/MOT/catkin_ws_jrmot/src/jpda_rospack/src/featurepointnet_model.py", line 271, in get_model
    is_training, bn_decay, end_points)
  File "/home/zlin/PycharmProjects/MOT/catkin_ws_jrmot/src/jpda_rospack/src/featurepointnet_model.py", line 154, in get_instance_seg_v1_net
    scope='conv1', bn_decay=bn_decay)
  File "/home/zlin/PycharmProjects/MOT/catkin_ws_jrmot/src/jpda_rospack/src/featurepointnet_tf_util.py", line 181, in conv2d
    data_format=data_format)
  File "/home/zlin/PycharmProjects/MOT/catkin_ws_jrmot/src/jpda_rospack/src/featurepointnet_tf_util.py", line 577, in batch_norm_for_conv2d
    return batch_norm_template(inputs, is_training, scope, [0,1,2], bn_decay, data_format)
  File "/home/zlin/PycharmProjects/MOT/catkin_ws_jrmot/src/jpda_rospack/src/featurepointnet_tf_util.py", line 531, in batch_norm_template
    data_format=data_format)
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 181, in func_with_args
    return func(*args, **current_args)
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 592, in batch_norm
    scope=scope)
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 401, in _fused_batch_norm
    _fused_batch_norm_inference)
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/utils.py", line 217, in smart_cond
    return control_flow_ops.cond(pred, fn1, fn2, name)
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 316, in new_func
    return func(*args, **kwargs)
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1864, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1725, in BuildCondBranch
    original_result = fn()
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 398, in _fused_batch_norm_inference
    data_format=data_format)
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 831, in fused_batch_norm
    name=name)
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 2034, in _fused_batch_norm
    is_training=is_training, name=name)
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/zlin/anaconda3/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
InternalError (see above for traceback): cuDNN launch failure : input shape ([4,1024,1,64])
     [[Node: conv1/bn/cond/FusedBatchNorm_1 = FusedBatchNorm[T=DT_FLOAT, data_format="NHWC", epsilon=0.001, is_training=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](conv1/bn/cond/FusedBatchNorm_1/Switch, conv1/bn/cond/FusedBatchNorm_1/Switch_1, conv1/bn/cond/FusedBatchNorm_1/Switch_2, conv1/bn/cond_1/AssignMovingAvg/sub/Switch, conv1/bn/cond_1/AssignMovingAvg_1/sub/Switch)]]

mvpatel2000 commented 4 years ago

This seems to be like a cuDNN / cuda version issue based on the "cuDNN launch failure" error. Can you please provide some details about what versions you are using? @abhijeetshenoi do you have any other ideas?

MuMuJun97 commented 4 years ago

@mvpatel2000 Thank you very much for how promptly you responded to my question. This is my hosting configuration environment:

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

---------
cudnn: 
libcudnn.so.6.0.21

----------
tensorflow-gpu                1.4.0
tensorflow-tensorboard        0.4.0

torch                         1.0.0
torchvision                   0.2.2.post3

I am now going to reinstall cudnn7 to try and resolve it.

abhijeetshenoi commented 4 years ago

I assume the cudnn reinstall will fix this: "Loaded runtime CuDNN library: 7102 (compatibility version 7100) but source was compiled with 6021 (compatibility version 6000)" Let us know if that doesn't fix it.

MuMuJun97 commented 4 years ago

@abhijeetshenoi

E tensorflow/stream_executor/cuda/cuda_dnn.cc:378] Loaded runtime CuDNN library: 7102 (compatibility version 7100) but source was compiled with 6021 (compatibility version 6000).

The problem above is mainly caused by inconsistent versions of tensorflow. I think your repository code is built in tensorflow 1.8 environment. And I have tensorflow 1.4 installed, so running your model gives me this error. It was fixed when I re-installed tensorflow 1.8.

StanfordVL / JRMOT_ROS

Bugs with 3d_detector.py #9