Training with custom dataset

ars359 commented 3 years ago

After some modifications in kitti_dataset.py file to take custom data as input this error I am facing while training. Using 2080Ti 2 stacks for training.

Traceback (most recent call last): File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [] [Condition x == y did not hold element-wise:] [x (IsNan_9:0) = ] [1] [y (assert_equal_1/y:0) = ] [0] [[{{node assert_equal_1/Assert/AssertGuard/Assert}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train.py", line 598, in results = sess.run(fetches, feed_dict=total_feed_dict) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [] [Condition x == y did not hold element-wise:] [x (IsNan_9:0) = ] [1] [y (assert_equal_1/y:0) = ] [0] [[node assert_equal_1/Assert/AssertGuard/Assert (defined at /home/aeye/Point-GNN-master/models/models.py:309) ]]

Original stack trace for 'assert_equal_1/Assert/AssertGuard/Assert': File "train.py", line 251, in t_loss_dict = model.loss(t_logits, t_class_labels, t_pred_box,t_encoded_gt_boxes, t_valid_gt_boxes, config['loss']) File "/home/aeye/Point-GNN-master/models/models.py", line 309, in loss False)]): File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/ops/check_ops.py", line 557, in assert_equal return control_flow_ops.Assert(condition, data, summarize=summarize) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py", line 193, in wrapped return _add_should_use_warning(fn(*args, *kwargs)) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 171, in Assert guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard") File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(args, kwargs) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1988, in cond orig_res_f, res_f = context_f.BuildCondBranch(false_fn) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1814, in BuildCondBranch original_result = fn() File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 169, in true_assert condition, data, summarize, name="Assert") File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 74, in _assert name=name) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "/home/aeye/Documents/virtual_Aeye/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()

WeijingShi commented 3 years ago

Hi @ars359, The code caught the localization loss to be NaN. Could you try with a small learning rate and see it still occurs? If it does, another possible source might be the custom data that generates an empty point cloud. You might disable the random shuffle in the data loader and locate the sample_idx that is causing the problem. Just another thought, have you try training on a single GPU?

pytholic commented 3 years ago

Hi @ars359 . I have some custom point clouds which I want to use for my training. Can you guide me on how to proceed with this? I cannot find any proper guidance on how to work with custom datasets.

aastha3 commented 2 years ago

I am looking to train on a custom dataset as well. But can't find any documentation around that. @WeijingShi - if you have outlined something in a doc/talk/gitpage/blog -- anything that helps remotely, please drop it in the comments. I will be forever grateful to you. Thanks.

WeijingShi commented 2 years ago

Hi @aastha3,

Sorry for the late reply.

The fast way may be to mimic the Kitti dataset and prepare your data in the same way. Essentially, we need the point cloud bin file, images (used in visualization), labels text files, and calibration files. The KITTI website has sample data and I found the readme file in the toolkit clarifies things a lot.

For a slightly deeper look, we just need point cloud and the labels in the same coordinate frame, in the Kitti dataset, we read the points here: https://github.com/WeijingShi/Point-GNN/blob/48f3d79d5b101d3a4b8439ba74c92fcad4f7cab0/dataset/kitti_dataset.py#L666

This function basically reads the point cloud file and does some coordinate transformation to make sure that the points are in the same coordinate system as the labels. So later, we just read label file https://github.com/WeijingShi/Point-GNN/blob/48f3d79d5b101d3a4b8439ba74c92fcad4f7cab0/train.py#L81

and use label files to annotate the points https://github.com/WeijingShi/Point-GNN/blob/48f3d79d5b101d3a4b8439ba74c92fcad4f7cab0/train.py#L110

Hope it helps, Weijing

WeijingShi / Point-GNN

Training with custom dataset #56