XuyangBai / D3Feat

[TensorFlow] Official implementation of CVPR'20 oral paper - D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features https://arxiv.org/abs/2003.03164
MIT License
259 stars 38 forks source link

Error during training on 3DMatch dataset #50

Open benjaminkelenyi opened 2 years ago

benjaminkelenyi commented 2 years ago

Hello, thank you very much for this nice work. I'm trying to train a model using the 3DMatch dataset, but after a while, I'm getting the following error:

[1059  530   38 ...  631  144  924]
Validation : 0.0% (timings : 58.95 0.00)
2022-02-07 16:05:30.380600: E tensorflow/stream_executor/dnn.cc:613] CUDNN_STATUS_NOT_SUPPORTED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(3935): 'cudnnBatchNormalizationForwardInference( cudnn.handle(), mode, &one, &zero, x_descriptor.handle(), x.opaque(), x_descriptor.handle(), y->opaque(), scale_offset_descriptor.handle(), scale.opaque(), offset.opaque(), estimated_mean.opaque(), maybe_inv_var, epsilon)'
Traceback (most recent call last):
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
  (0) Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
  (1) Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[optimizer/gradients/KernelPointNetwork/Sum_1_grad/Fill/value/_571]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'IteratorGetNext':
  File "training_3DMatch.py", line 175, in <module>
    dataset.init_input_pipeline(config)
  File "/home/rambo/ws_benji/D3Feat/datasets/common.py", line 770, in init_input_pipeline
    self.flat_inputs = iter.get_next()
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 429, in get_next
    name=name)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_dataset_ops.py", line 2518, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: cuDNN launch failure : input shape ([68418,64,1,1])
         [[{{node KernelPointNetwork/layer_0/simple_0/batch_normalization/cond/FusedBatchNormV3_1}}]]
         [[loss/cdist/Sqrt/_1141]]
  (1) Internal: cuDNN launch failure : input shape ([68418,64,1,1])
         [[{{node KernelPointNetwork/layer_0/simple_0/batch_normalization/cond/FusedBatchNormV3_1}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "training_3DMatch.py", line 207, in <module>
    trainer.train(model, dataset)
  File "/home/rambo/ws_benji/D3Feat/utils/trainer.py", line 387, in train
    self.validation(model, dataset)
  File "/home/rambo/ws_benji/D3Feat/utils/trainer.py", line 441, in validation
    desc_loss, det_loss, accuracy, ave_d_pos, ave_d_neg, dists, scores, anc_key, pos_key = self.sess.run(ops, {model.dropout_prob: 1.0})
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: cuDNN launch failure : input shape ([68418,64,1,1])
         [[node KernelPointNetwork/layer_0/simple_0/batch_normalization/cond/FusedBatchNormV3_1 (defined at /home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[loss/cdist/Sqrt/_1141]]
  (1) Internal: cuDNN launch failure : input shape ([68418,64,1,1])
         [[node KernelPointNetwork/layer_0/simple_0/batch_normalization/cond/FusedBatchNormV3_1 (defined at /home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'KernelPointNetwork/layer_0/simple_0/batch_normalization/cond/FusedBatchNormV3_1':
  File "training_3DMatch.py", line 189, in <module>
    model = KernelPointFCNN(dataset.flat_inputs, config)
  File "/home/rambo/ws_benji/D3Feat/models/KPFCNN_model.py", line 130, in __init__
    self.out_features, self.out_scores = assemble_FCNN_blocks(self.anchor_inputs, self.config, self.dropout_prob)
  File "/home/rambo/ws_benji/D3Feat/models/D3Feat.py", line 15, in assemble_FCNN_blocks
    F = assemble_CNN_blocks(inputs, config, dropout_prob)
  File "/home/rambo/ws_benji/D3Feat/models/network_blocks.py", line 1099, in assemble_CNN_blocks
    training)
  File "/home/rambo/ws_benji/D3Feat/models/network_blocks.py", line 242, in simple_block
    training))
  File "/home/rambo/ws_benji/D3Feat/models/network_blocks.py", line 160, in batch_norm
    training=training)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func
    return func(*args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/layers/normalization.py", line 327, in batch_normalization
    return layer.apply(inputs, training=training)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 330, in new_func
    return func(*args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 1700, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/layers/normalization.py", line 167, in call
    return super(BatchNormalization, self).call(inputs, training=training)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/keras/layers/normalization.py", line 710, in call
    outputs = self._fused_batch_norm(inputs, training=training)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/keras/layers/normalization.py", line 565, in _fused_batch_norm
    training, _fused_batch_norm_training, _fused_batch_norm_inference)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/keras/utils/tf_utils.py", line 59, in smart_cond
    pred, true_fn=true_fn, false_fn=false_fn, name=name)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/smart_cond.py", line 59, in smart_cond
    name=name)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1235, in cond
    orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1061, in BuildCondBranch
    original_result = fn()
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/keras/layers/normalization.py", line 562, in _fused_batch_norm_inference
    data_format=data_format)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_impl.py", line 1502, in fused_batch_norm
    name=name)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 4620, in fused_batch_norm_v3
    name=name)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/rambo/anaconda3/envs/tf_n1.15/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

2022-02-07 16:05:32.189426: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.

The error comes from this section of code: image

Do you have any idea why is this happening?

Thanks a lot, Benjamin