[Bug] UPD - ValueError: Plane vertices are not coplanar.

EricLee0224 commented 5 months ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[x] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

System environment: sys.platform: linux Python: 3.8.19 | packaged by conda-forge | (default, Mar 20 2024, 12:47:35) [GCC 12.3.0] CUDA available: True MUSA available: False numpy_random_seed: 545726448 GPU 0,1,2,3,4,5,6,7: NVIDIA RTX A6000 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.58 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 1.11.0 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.12.0 OpenCV: 4.9.0 MMEngine: 0.10.3

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 545726448 Distributed launcher: pytorch Distributed training: True GPU number: 8

Reproduces the problem - code sample

-

Reproduces the problem - command or script

3D mv-Det: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet --launcher="pytorch"

3D mv-VG: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 tools/train.py configs/grounding/mv-grounding_8xb12_embodiedscan-vg-9dof.py --work-dir=work_dirs/mv-3dground --launcher="pytorch"

Reproduces the problem - error message

04/15 13:56:37 - mmengine - INFO - Checkpoints will be saved to /data/zyp/code/EmbodiedScan/work_dirs/mv-3dground.

/data/zyp/code/EmbodiedScan/embodiedscan/models/layers/fusion_layers/point_fusion.py:48: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone( ).detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor).
pcd_rotate_mat = (torch.tensor(img_meta['pcd_rotation'],
/data/zyp/code/EmbodiedScan/embodiedscan/models/layers/fusion_layers/point_fusion.py:48: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone( ).detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor).
pcd_rotate_mat = (torch.tensor(img_meta['pcd_rotation'],
/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmcv/cnn/bricks/transformer.py:524: UserWarning: position encoding of key ismissing in MultiheadAttention.
warnings.warn(f'position encoding of key is'
Traceback (most recent call last):
File "tools/train.py", line 133, in
main()
File "tools/train.py", line 129, in main
runner.train()
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run
self.run_epoch() File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
self.run_iter(idx, data_batch) File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter
outputs = self.runner.model.train_step( File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(data, mode=mode) File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, *kwargs) File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
output = self.module(inputs[0], kwargs[0])
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, kwargs) File "/data/zyp/code/EmbodiedScan/embodiedscan/models/detectors/sparse_featfusion_grounder.py", line 666, in forward
return self.loss(inputs, data_samples, kwargs)
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/detectors/sparse_featfusion_grounder.py", line 507, in loss
losses = self.bbox_head.loss(*head_inputs_dict,
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 637, in loss
losses = self.loss_by_feat(loss_inputs)
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 668, in loss_by_feat
losses_cls, losses_bbox = multi_apply( File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmdet/models/utils/misc.py", line 219, in multi_apply
return tuple(map(list, zip(map_results)))
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 711, in loss_by_feat_single
cls_reg_targets = self.get_targets(cls_scores_list,
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 258, in get_targets
pos_inds_list, neg_inds_list) = multi_apply(self._get_targets_single,
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmdet/models/utils/misc.py", line 219, in multi_apply
return tuple(map(list, zip(map_results))) File "/data/zyp/code/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 398, in _get_targets_single
assign_result = self.assigner.assign( File "/data/zyp/code/EmbodiedScan/embodiedscan/models/task_modules/assigners/hungarian_assigner.py", line 113, in assign
cost = match_cost(pred_instances=pred_instances_3d,
File "/data/zyp/code/EmbodiedScan/embodiedscan/models/losses/match_cost.py", line 108, in call
overlaps = pred_bboxes.overlaps(pred_bboxes, gt_bboxes)
File "/data/zyp/code/EmbodiedScan/embodiedscan/structures/bbox_3d/eulerbox3d.py", line 134, in overlaps
, iou3d = box3d_overlap(corners1, corners2, eps=eps)
File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/pytorch3d/ops/iou_box3d.py", line 160, in box3d_overlap
_check_coplanar(boxes2, eps) File "/data/zyp/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/pytorch3d/ops/iou_box3d.py", line 66, in _check_coplanar
raise ValueError(msg) ValueError: Plane vertices are not coplanar

Additional information

I can run the 3D mv-det task very smoothly in both training and testing. However, when I run the 3D mv-VG task in the same environment with 8*A6000 (48G), it always encounters a ValueError: Plane vertices are not coplanar in the first epoch.

I have checked the related issues #22, #32, #30, facebookresearch/pytorch3d/issues/992, and facebookresearch/pytorch3d/issues/1771.

I have also tried the following solutions:

Modifying eps in box3d_overlap with values like 1e-2, 1e-3, 1e-4, and 1e-5.
Changing the learning rate (lr) in the training script to values like 5e-2 and 5e-4.
Training with detection checkpoint and without detection checkpoint.
Using 2xA6000, 4xA6000, and 8xA6000.
Using --resume and --resume auto

However, none of these solutions have worked so far. Could anyone please share how to solve this issue or provide a successful environment setup? Will the team look into this matter? Many thanks.

mxh1999 commented 5 months ago

I completely understand your feeling with this situation.

Based on my understanding, this issue often arises when one of the predicted boxes has a side length that is too short.

Here are some possible solutions:

Adjust the 'val_interval' in the configuration settings to avoid evaluating the model when it's unstable.
Enforce the modification of all boxes that have sides smaller than a certain threshold during evaluation.
Just remove the '_check_coplanar' and '_check_nonzero' checks.

I hope this helps!

iris0329 commented 5 months ago

I met the same issue when training 3dv-grounding using A6000, when computing the cost of the Hungarian assignment of training.

I think it's inappropriate to simply modify eps in box3d_overlap, because in '_check_coplanar' and '_check_nonzero' checks, eps limit

the maximum of the dot product of the edge vector with the surface normal vector
the minimum area of each face of the box

respectively.

So setting as follows fixes my issue:

# file: 
# site-packages/pytorch3d/ops/iou_box3d.py

_check_coplanar(boxes1, 1e-2)
_check_coplanar(boxes2, 1e-2)
_check_nonzero(boxes1, 1e-4)
_check_nonzero(boxes2, 1e-4)

Basically, it's similar to the idea that just remove the '_check_coplanar' and '_check_nonzero' checks as suggested above. But I'm not sure if there are any serious consequences to extending this limit

EricLee0224 commented 5 months ago

@mxh1999 @iris0329 Thanks for your further solutions! I will try it and update issue if there have any progress.

EricLee0224 commented 5 months ago

I met the same issue when training 3dv-grounding using A6000, when computing the cost of the Hungarian assignment of training.

I think it's inappropriate to simply modify eps in box3d_overlap, because in '_check_coplanar' and '_check_nonzero' checks, eps limit

the maximum of the dot product of the edge vector with the surface normal vector

the minimum area of each face of the box

respectively.

So setting as follows fixes my issue:
# file: 
# site-packages/pytorch3d/ops/iou_box3d.py

_check_coplanar(boxes1, 1e-2)
_check_coplanar(boxes2, 1e-2)
_check_nonzero(boxes1, 1e-4)
_check_nonzero(boxes2, 1e-4)
Basically, it's similar to the idea that just remove the '_check_coplanar' and '_check_nonzero' checks as suggested above. But I'm not sure if there are any serious consequences to extending this limit

NB, it really works. While some checks have been removed, the val/test results are not significantly different from the official reports. This seems to be an acceptable solution.

iris0329 commented 5 months ago

Hi @EricLee0224 , can you share your reproduced result? The val/test result I got is lower than the official one.

EricLee0224 commented 5 months ago

Sure. I think it is normal to have minor fluctuations in the results of the 3DVG task. You can conduct multiple experiments to observe the results, but I feel that these fluctuations are negligible and not necessary to focus on. Since the baseline values are not very high, it implies that we need to explore better model approaches to significantly (maybe +5~10?) improve performance.

iris0329 commented 5 months ago

See, thank you! similar results.

OpenRobotLab / EmbodiedScan