OpenRobotLab / EmbodiedScan

[CVPR 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
https://tai-wang.github.io/embodiedscan/
Apache License 2.0
395 stars 26 forks source link

[Bug] Low reproducibility? Limit gpus? #32

Closed mrsempress closed 3 months ago

mrsempress commented 3 months ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

System environment: [1085/1460] sys.platform: linux Python: 3.8.17 (default, Jul 5 2023, 21:04:15) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 1551893665 GPU 0,1: NVIDIA A100-SXM4-80GB CUDA_HOME: /mnt/lustre/share/cuda-11.0 NVCC: Cuda compilation tools, release 11.0, V11.0.221 GCC: gcc (GCC) 5.4.0 PyTorch: 1.12.1 PyTorch compiling details: PyTorch built with:

GCC 9.3

C++ Version: 201402

Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications

Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)

OpenMP 201511 (a.k.a. OpenMP 4.5)

LAPACK is enabled (usually provided by MKL)

NNPACK is enabled

CPU capability usage: AVX2

CUDA Runtime 11.3

NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code= sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37

CuDNN 8.3.2 (built against CUDA 11.5)

Magma 2.5.2

Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_BGEMM−DUSEQNNPACK−DUSEPYTORCHQNNPACK−DUSEXNNPACK−DSYMBOLICATEMOBILEDEBUGHANDLE−DEDGEPROFILERUSEKINETO−O2−fPIC−Wno−narrowing−Wall−Wextra−Werror=return−type−Wno−missing−field−initializers−Wno−type−limits−Wno−array−bounds−Wno−unknown−pragmas−Wno−unuse -parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostic−color=always−faligned−new−Wno−unused−but−set−variable−Wno−maybe−uninitialized−fno−math−errno−fno−trapping−math−Werror=format−Werror=cast−function−type−Wno−stringop−overflow,LAPACKINFO=mkl,PERFWITHAVX=1,PERFWITHAVX2=1,PERFWITHAVX512=1,TORCHVERSION=1.12. , USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1 OpenCV: 4.9.0 MMEngine: 0.10.3

Reproduces the problem - code sample

N/A

Reproduces the problem - command or script

sh tools/mv-grounding.sh

Reproduces the problem - error message

The reproducibility results are: AP25: | Type | Easy | Hard | View-Dep | View-Indep | Unique | Multi | Overall | | results | 0.2093 | 0.1840 | 0.1966 | 0.2129 | 0.0000 | 0.2073 | 0.2073 |

AP50: | Type | Easy | Hard | View-Dep | View-Indep | Unique | Multi | Overall | | results | 0.0535 | 0.0452 | 0.0581 | 0.0501 | 0.0000 | 0.0528 | 0.0528 |

But, the results in the paper are: AP25: | Type | Easy | Hard | View-Dep | View-Indep | Overall | | results | 0.2711 | 0.2012 | 0.2342 | 0.2637 | 0.2572 |

In addition, the training can only be completed when the number of GPUs is 8. When the number of GPUs is 2 or 4, issue 30 will sometimes occur, and issue 26 will sometimes occur.

Additional information

  1. Is there a limit to the number of GPUs, or is the problem random, and it just runs out when gpu=8?
  2. Are the results of visual grounding reported in the paper using the default config in tools/mv_grounding.sh? Or added fcaf_coder or modified other parameters?
Tai-Wang commented 3 months ago

You need to reduce the learning rate by 2 or 4 accordingly because the actual batch size is only 1/4 of that in our experiments. It should yield a comparable result when you adjust the optimizer setting, although we did not try them before.

mrsempress commented 3 months ago

When I reproduced it, it was still based on GPU 8, as written in your mv_grounding.sh, but the result was not good. When the GPU changes, an error message will appear, and it will not run successfully.

Tai-Wang commented 3 months ago

Do you remove the pretrained checkpoint from the config? I find your result is lower than our reported performance here. You can first reproduce the performance reported in our repo because we re-split the training/val/test set for the challenge as explained here.

mrsempress commented 3 months ago

I removed the pretrained checkpoint from the config because I didn't know that pretrained weights were necessary and didn't see the detection branch's role on the visual grounding branch in the pipeline. I will try again to get the pre-training weights and redo the visual grounding task. Thank you for your reply.

Tai-Wang commented 3 months ago

OK. We found loading the pretrained detection checkpoint to be a helpful trick, as it is mentioned in BUTD-DETR. Look forward to your further feedback.

ZCMax commented 3 months ago

I removed the pretrained checkpoint from the config because I didn't know that pretrained weights were necessary and didn't see the detection branch's role on the visual grounding branch in the pipeline. I will try again to get the pre-training weights and redo the visual grounding task. Thank you for your reply.

Since the feature extraction pipeline can be shared by detection and visual grounding task, so we can use the 3D detection pre-trained checkpoint for weight initialization. It can be useful for better grounding performance and accelerate the training convergence at some extent.

Tai-Wang commented 3 months ago

Close due to inactivity. Please feel free to reopen this issue if you have any further questions.

mrsempress commented 3 months ago

After loading your checkpoint, the performance exceeded what your paper reported(+7.95%).

The results in the paper are:
AP25:
| Type | Easy | Hard | View-Dep | View-Indep | Overall |
| results | 0.2711 | 0.2012 | 0.2342 | 0.2637 | 0.2572 |

The reproducibility results are: (with load your checkpoint)
AP25:
| Type | Easy | Hard | View-Dep | View-Indep | Unique | Multi | Overall |
| results | 0.3489 | 0.3018|0.3567|0.3277|0.0000|0.3377|0.3377|

AP50:
| Type | Easy | Hard | View-Dep | View-Indep | Unique | Multi | Overall |
| results | 0.1168|0.0925|0.1127|0.1159|0.0000|0.1148|0.1148|
  1. Another question is why the result of overall is the same as the result of multi.
  2. In addition, you mentioned that using detection checkpoint is important. In my experiment, it increased by 13.04%. If the grounding checkpoint is also used as the initialization of detection, will there be an improvement? If we keep looping initialization, can we get better results?
ZCMax commented 3 months ago

After loading your checkpoint, the performance exceeded what your paper reported(+7.95%).

The results in the paper are:
AP25:
| Type | Easy | Hard | View-Dep | View-Indep | Overall |
| results | 0.2711 | 0.2012 | 0.2342 | 0.2637 | 0.2572 |

The reproducibility results are: (with load your checkpoint)
AP25:
| Type | Easy | Hard | View-Dep | View-Indep | Unique | Multi | Overall |
| results | 0.3489 | 0.3018|0.3567|0.3277|0.0000|0.3377|0.3377|

AP50:
| Type | Easy | Hard | View-Dep | View-Indep | Unique | Multi | Overall |
| results | 0.1168|0.0925|0.1127|0.1159|0.0000|0.1148|0.1148|
  1. Another question is why the result of overall is the same as the result of multi.
    1. In addition, you mentioned that using detection checkpoint is important. In my experiment, it increased by 13.04%. If the grounding checkpoint is also used as the initialization of detection, will there be an improvement? If we keep looping initialization, can we get better results?
  1. Since all the prompts belong to the multiple type, the overall performance is exactly the same as multiple.
  2. Actually, an exploration can be joint grounding and detection training, illustrated in BUTD-DETR, reformulating the detection task to the category prompt grounding task. It may boost both the detection and grounding performance at the same time.