Cannot work with 8 gpus, but work with 4. gpus

iris0329 commented 3 months ago

issue

Hi, thanks for your work!

when using the following command to run the code, I met a strange error: the code cannot work with 8 gpus even when changing batch size to 1 per gpu, but can work with 4 gpus.

python -m torch.distributed.launch --nproc_per_node=4 --master_port=25622 tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --launcher='pytorch' --work-dir='logs/det'

The machine I use is A6000, with 48G memory.

here is the logs

04/12 14:46:55 - mmengine - INFO - Epoch(train)  [1][  50/1946]  lr: 1.0000e-03  eta: 21:08:51  time: 3.2672  data_time: 0.3987  memory: 10248  grad_norm: 0.9408  loss: 2.3556  loss_center: 0.6253  loss_bbox: 0.7773  loss_cls: 0.9531
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 119878 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 119879 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 119880 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 119881 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 119882 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 119883 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 119885 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 119877) of binary: /root/anaconda3/envs/embodiedscan/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/embodiedscan/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/envs/embodiedscan/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/envs/embodiedscan/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/root/anaconda3/envs/embodiedscan/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/root/anaconda3/envs/embodiedscan/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/root/anaconda3/envs/embodiedscan/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/root/anaconda3/envs/embodiedscan/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/embodiedscan/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
tools/train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-12_14:48:53
  host      : b70316d392fe
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 119877)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 119877
=======================================================

enviroment

System environment:
    sys.platform: linux
    Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
    CUDA available: True
    MUSA available: False
    numpy_random_seed: 1791069987
    GPU 0,1,2,3,4,5,6,7: NVIDIA RTX A6000
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 11.8, V11.8.89
    GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
    PyTorch: 1.11.0
    PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

    TorchVision: 0.12.0
    OpenCV: 4.9.0
    MMEngine: 0.10.3

Do you encounter similar errors, or could you give me some ideas about this one?

EricLee0224 commented 3 months ago

For the training of 3D detection task, my environment is almost the same as yours (8 * A6000 with 48G), and I didn't encounter the same error. However, for the 3D grounding task, I encountered the error "Plane vertices are not coplanar." which is quite strange...

iris0329 commented 3 months ago

hi @EricLee0224, I also encountered the error "Plane vertices are not coplanar." when running the grounding task, but this issue is mentioned in this repo; you could have a check. Could you please share your environment settings for 3D detection task? I would appreciate it a lot!

EricLee0224 commented 3 months ago

hi @EricLee0224, I also encountered the error "Plane vertices are not coplanar." when running the grounding task, but this issue is mentioned in this repo; you could have a check. Could you please share your environment settings for 3D detection task? I would appreciate it a lot!

1）I have indeed noticed the discussions in issue #30 and #22. They suggest adjusting the value of eps (I have tried 1e-5/1e-4/1e-3/1e-2 but it still doesn't work) and using the --resume parameter (the error message tells me that no available ckpt is found), but none of them are effective. Have you successfully resolved this issue? 2）Sure. I run： CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet --launcher="pytorch"” and you could find the environments in log:

2024/04/13 18:02:50 - mmengine - INFO -

System environment: sys.platform: linux Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 430976289 GPU 0,1,2,3,4,5,6,7: NVIDIA RTX A6000 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.58 GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 PyTorch: 1.11.0 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.12.0 OpenCV: 4.9.0 MMEngine: 0.10.3

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 430976289 Distributed launcher: pytorch Distributed training: True GPU number: 8

iris0329 commented 3 months ago

Thank you, @EricLee0224 , I guess that the reason why I cannot run train/3ddet is because of the RAM limitation (500G) on my side. Could you please show your RAM? ( the Mem part when usinghtop) Ah, you have seen the issues; those are what I referred to. I am still trying to figure it out, and I will tell you if I have any idea!

iris0329 commented 3 months ago

I figure it out, it's truly because of the RAM limitation (500G)

OpenRobotLab / EmbodiedScan