[Bug] ValueError: Plane vertices are not coplanar. (box3d_overlap)

mrsempress commented 6 months ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

System environment: [1085/1460] sys.platform: linux Python: 3.8.17 (default, Jul 5 2023, 21:04:15) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 1551893665 GPU 0,1: NVIDIA A100-SXM4-80GB CUDA_HOME: /mnt/lustre/share/cuda-11.0 NVCC: Cuda compilation tools, release 11.0, V11.0.221 GCC: gcc (GCC) 5.4.0 PyTorch: 1.12.1 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code= sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.3.2 (built against CUDA 11.5)
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSEKINETO -DUSE$ BGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unuse$ -parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostic$ -color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.$ , USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1 OpenCV: 4.9.0 MMEngine: 0.10.3

Reproduces the problem - code sample

In embodiedscan/structures/bbox_3d/euler_box3d.py#L134

_, iou3d = box3d_overlap(corners1, corners2, eps=eps)

Reproduces the problem - command or script

sh tools/mv-grounding.sh

Reproduces the problem - error message

04/01 21:32:20 - mmengine - INFO - Epoch(train)  [5][1300/2001]  base_lr: 5.0000e-04 lr: 5.0000e-04  eta: 10:10:27  time: 2.5333  data_time: 0.1731  memory: 29459  grad_norm: 35.8510  loss: 8.8971  loss_cls: 1.0159  loss_bbox: 0.4671  d0.loss_cls: 1.0471  d0.loss_bbox: 0.46$
5  d1.loss_cls: 1.0285  d1.loss_bbox: 0.4565  d2.loss_cls: 1.0182  d2.loss_bbox: 0.4596  d3.loss_cls: 1.0093  d3.loss_bbox: 0.4628  d4.loss_cls: 1.0077  d4.loss_bbox: 0.4638
Traceback (most recent call last):
  File "tools/train.py", line 133, in <module>
    main()
  File "tools/train.py", line 129, in main
    runner.train()
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train
    model = self.train_loop.run()  # type: ignore
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run
    self.run_epoch()
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
    self.run_iter(idx, data_batch)
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter
    outputs = self.runner.model.train_step(
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
    losses = self._run_forward(data, mode='loss')
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
    results = self(**data, mode=mode)
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/detectors/sparse_featfusion_grounder.py", line 666, in forward
    return self.loss(inputs, data_samples, **kwargs)
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/detectors/sparse_featfusion_grounder.py", line 507, in loss
    losses = self.bbox_head.loss(**head_inputs_dict,
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 637, in loss
    losses = self.loss_by_feat(*loss_inputs)
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 668, in loss_by_feat
    losses_cls, losses_bbox = multi_apply(
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmdet/models/utils/misc.py", line 219, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 711, in loss_by_feat_single
    cls_reg_targets = self.get_targets(cls_scores_list,
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 258, in get_targets
    pos_inds_list, neg_inds_list) = multi_apply(self._get_targets_single,
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmdet/models/utils/misc.py", line 219, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 398, in _get_targets_single
    assign_result = self.assigner.assign(
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/task_modules/assigners/hungarian_assigner.py", line 113, in assign
    cost = match_cost(pred_instances=pred_instances_3d,
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/losses/match_cost.py", line 108, in __call__
    overlaps = pred_bboxes.overlaps(pred_bboxes, gt_bboxes)
  File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/structures/bbox_3d/euler_box3d.py", line 134, in overlaps
    _, iou3d = box3d_overlap(corners1, corners2, eps=eps)
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/pytorch3d/ops/iou_box3d.py", line 159, in box3d_overlap
    _check_coplanar(boxes1, eps)
  File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/pytorch3d/ops/iou_box3d.py", line 66, in _check_coplanar
    raise ValueError(msg)
ValueError: Plane vertices are not coplanar
srun: error: SH-IDC1-10-140-24-25: task 1: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=3342784.0
slurmstepd: error: *** STEP 3342784.0 ON SH-IDC1-10-140-24-25 CANCELLED AT 2024-04-01T21:32:52 ***

Additional information

In https://github.com/facebookresearch/pytorch3d/issues/992, they suggest increasing EPS. Will this problem occur under your default setting of 1e-4? If so, how do I adjust the EPS value? And this happened in my 5th epoch, with randomness, what is the reason for this?

mrsempress commented 6 months ago

But when I increase the value of eps, the error "Planes have zero areas" will be reported, that is, _check_coplanar() and _check_nonzero() will conflict.

Tai-Wang commented 6 months ago

Yes, that is the case. The EPS for your two mentioned check operations is in conflict. You may need to tune the optimizer settings to make the training more stable. I notice you only use two GPUs and you may need to reduce the learning rate by 2 or 4 accordingly.

Tai-Wang commented 6 months ago

Please see #22 for more explanations about this bug.

mrsempress commented 6 months ago

Thanks, I will try it again.

EricLee0224 commented 5 months ago

Thanks, I will try it again.

Hi, @mrsempress , did you solved this issues?

mrsempress commented 5 months ago

@EricLee0224 No, I don't solve this issue. I only keep the original value and retrain from the beginning. Sometimes, it can train completely.

Tai-Wang commented 5 months ago

Hi all, we have just sorted out the occupancy prediction baseline recently. While open-sourcing those parts, we will have a closer look at this problem, particularly for the visual grounding baseline. We will try to address it in two weeks.

iris0329 commented 5 months ago

Hi, I solved this issue in https://github.com/OpenRobotLab/EmbodiedScan/issues/40#issuecomment-2058598322 please have a check I am now using this strategy to train to the model, and it runs successfully for 100 iterations.

OpenRobotLab / EmbodiedScan