[Bug] Why mvdet uses do much RAM memory during evaluation? Shooting above 500+gb and causes the program to crash

yxchng commented 3 months ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA H100 80GB HBM3
CUDA_HOME: /fs/applications/cuda/12.1.1
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18)
PyTorch: 2.2.1+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.1+cu121
OpenCV: 4.9.0
MMEngine: 0.10.3
MMDetection: 3.3.0
MMDetection3D: 1.4.0+
spconv2.0: False

Reproduces the problem - code sample

-

Reproduces the problem - command or script

python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py

Reproduces the problem - error message

job schedular indicates TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.

Additional information

Evaluation shouldn't use that much memory, 500+gb is crazy!

mxh1999 commented 3 months ago

Can you please describe the problem in more detail? Do you mean CUDA OOM?

yxchng commented 3 months ago

Not CUDA memory. RAM memory using more than 500gb. 内存用很多(>500gb)，不是显存。

mxh1999 commented 3 months ago

That's interesting, the entire Embodiedscan dataset only takes up about 300G. Are you sure there are no other programs taking RAM up?

yxchng commented 3 months ago

@mxh1999 Model itself has to use some RAM too. Quite sure your machine has more than 500gb memory so you didn't pay attention to memory usage. Maybe you can check what is the peak memory usage? It shoots up to more than 500gb during evaluation phase.

Tai-Wang commented 3 months ago

I have also seldom encountered such cases recently, and maybe we will have a closer look at this issue. At the same time, we welcome more cues/information about this problem to help us locate it more quickly.

iris0329 commented 3 months ago

I met this ram issue as well, when using 8 gpus with the ddp way to train the 3ddet;

it cost more than 500GB of RAM, 500G reaches my RAM limitation and ddp process will fail; When using 4 gpus to train, 4x4 batch size, it cost nearly 400GB of RAM.

i am not sure the situation if use slurm to start.

I am looking forward to any progress on this issue.

henryzhengr commented 3 months ago

I have also seldom encountered such cases recently, and maybe we will have a closer look at this issue. At the same time, we welcome more cues/information about this problem to help us locate it more quickly.

The issue seems to be the mmengine dataset saving a copy of data_list per GPU rank in RAM during dataset initialization. A quick patch was to use shared memory for this data info list. However, data time would be affected by this. Looking forward to a better fix.

Outlying3720 commented 2 months ago

I have met the same question too. When I train the model on a server with 700G mem, everything is fine. When I move to a server with 200G mem, before it finish epoch 1, it will always trigger kernel OOM kill.

Outlying3720 commented 2 months ago

I have also seldom encountered such cases recently, and maybe we will have a closer look at this issue. At the same time, we welcome more cues/information about this problem to help us locate it more quickly.

The issue seems to be the mmengine dataset saving a copy of data_list per GPU rank in RAM during dataset initialization. A quick patch was to use shared memory for this data info list. However, data time would be affected by this. Looking forward to a better fix.

@henryzhengr hi, Could you share your solution code? That's will help me a lot. Thank you.

henryzhengr commented 1 month ago

Solution

For quick solution just replace the following lines to the code below https://github.com/OpenRobotLab/EmbodiedScan/blob/67110231a8759009ca822ff3f2b3ed577674903b/embodiedscan/datasets/embodiedscan_dataset.py#L59-L64

Make sure to install SharedArray in your environment

Code:

    super().__init__(ann_file=ann_file,
                     metainfo=metainfo,
                     data_root=data_root,
                     pipeline=pipeline,
                     test_mode=test_mode,
                     serialize_data=False,
                     **kwargs)
    self.share_serialize_data()

def share_serialize_data(self):
    cur_rank, num_gpus = mmengine.dist.get_rank(), mmengine.dist.get_world_size()
    if cur_rank == 0:
        print("Rank 0 initialized the data")
        if os.path.exists(f"/dev/shm/embodiedscan_data_bytes"):
            self.data_bytes = SharedArray.attach(f"shm://embodiedscan_data_bytes")
            self.data_address = SharedArray.attach(f"shm://embodiedscan_data_address")
        else:
            self.data_bytes, self.data_address = self._serialize_data()
            print(f'Loading training data to shared memory (file limit not set)')
            data_bytes_shm_arr = SharedArray.create('shm://embodiedscan_data_bytes', self.data_bytes.shape, dtype=self.data_bytes.dtype)
            data_bytes_shm_arr[...] = self.data_bytes[...]
            data_bytes_shm_arr.flags.writeable = False

            data_address_shm_arr = SharedArray.create('shm://embodiedscan_data_address', self.data_address.shape, dtype=self.data_address.dtype)
            data_address_shm_arr[...] = self.data_address[...]
            data_address_shm_arr.flags.writeable = False
            print(f'Training data list has been saved to shared memory')
        dist.barrier()
    else:
        dist.barrier()
        print(f'Reading training data from shm. rank {cur_rank}')
        self.data_bytes = SharedArray.attach("shm://embodiedscan_data_bytes")
        self.data_address = SharedArray.attach("shm://embodiedscan_data_address")
        print(f'Done reading training data. rank {cur_rank}')

    self.serialize_data = True

Advantage

In original code, the more GPU you use the more RAM the program consumes. So this will save RAM for distributed training on multiple GPUs, but not on single GPU.

Disadvantages and stuff to take note

Cleanup: Make sure to unlink the shared array upon program termination to avoid memory leaks.
Data-loading Bottlenecks: In some cases, I've encountered slowdowns in data-loading, though unsure of the cause yet, Restarting the program solves my issue.
Timeout Issue: Currently, only the rank 0 GPU processes and transfers data to memory. This sometimes results in exceptions if the wait exceeds a predefined timeout. A more efficient approach might be to distribute the data processing task across all GPUs, allowing each rank to handle and transfer a portion of the data independently. (Or a lazy solution is just to set timeout longer :-) )

Tai-Wang commented 3 weeks ago

Thank @henryzhengr for providing a work-around solution. I close this issue for now and welcome further discussions if there are new observations.

OpenRobotLab / EmbodiedScan