Cuda Out of Memory and Minimum System Specification for running 3D human Pose Estimation

leilaUEA commented 1 year ago

Hi, I was trying to run the 3d pose estimation part in: https://github.com/Walter0807/MotionBERT/blob/main/docs/pose3d.md

on both my laptop and university cluster, and I get cuda out of memory error. my question is:

1) How to solve cuda out of memory error? 2) What is the minimum system specification for human 3D pose estimation? (If you have tested it on a specific computer if you could give me an specification that would be great)

Thank you very much in advance. I look forward to hearing from you.

Walter0807 commented 1 year ago

Hi, are you running training or inference? You could consider reducing the batch size.

leilaUEA commented 1 year ago

Thanks very much for your response. I was trying to run train.py according to the following instructions: https://github.com/Walter0807/MotionBERT/blob/main/docs/pose3d.md

starting from the following command: python train.py --config configs/pose3d/MB_train_h36m.yaml --checkpoint checkpoint/pose3d/MB_train_h36m

I changed the "args.batch_size" from its default value which is 32 to 16, 8, 4, 2 and still getting the "cuda out of memory" error:

batch size = 32

CUDA out of memory. Tried to allocate 518.00 MiB (GPU 0; 3.82 GiB total capacity; 2.75 GiB already allocated; 103.38 MiB free; 2.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

batch size = 16

CUDA out of memory. Tried to allocate 492.00 MiB (GPU 0; 3.82 GiB total capacity; 2.60 GiB already allocated; 246.25 MiB free; 2.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

batch size = 8

CUDA out of memory. Tried to allocate 246.00 MiB (GPU 0; 3.82 GiB total capacity; 2.62 GiB already allocated; 215.44 MiB free; 2.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

batch size = 4

CUDA out of memory. Tried to allocate 124.00 MiB (GPU 0; 3.82 GiB total capacity; 2.75 GiB already allocated; 24.19 MiB free; 2.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

batch size = 2

CUDA out of memory. Tried to allocate 62.00 MiB (GPU 0; 3.82 GiB total capacity; 2.58 GiB already allocated; 59.81 MiB free; 2.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I was wondering if I change the computer this or similar programs can work or I need to change something else in the code? What GPU did you use to run the training, I would appreciate if you could let me know.

Thank you very much in advance

Walter0807 commented 1 year ago

I think 3GB cuda memory might not be enough for training.Best,WentaoSent from my iPhoneOn 1 Jul 2023, at 01:41, leilaUEA @.***> wrote: Thanks very much for your response. I was trying to run train.py according to the following instructions: https://github.com/Walter0807/MotionBERT/blob/main/docs/pose3d.md starting from the following command: python train.py --config configs/pose3d/MB_train_h36m.yaml --checkpoint checkpoint/pose3d/MB_train_h36m I changed the "args.batch_size" from its default value which is 32 to 16, 8, 4, 2 and still getting the "cuda out of memory" error:

batch size = 32 CUDA out of memory. Tried to allocate 518.00 MiB (GPU 0; 3.82 GiB total capacity; 2.75 GiB already allocated; 103.38 MiB free; 2.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

batch size = 16 CUDA out of memory. Tried to allocate 492.00 MiB (GPU 0; 3.82 GiB total capacity; 2.60 GiB already allocated; 246.25 MiB free; 2.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

batch size = 8 CUDA out of memory. Tried to allocate 246.00 MiB (GPU 0; 3.82 GiB total capacity; 2.62 GiB already allocated; 215.44 MiB free; 2.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

batch size = 4 CUDA out of memory. Tried to allocate 124.00 MiB (GPU 0; 3.82 GiB total capacity; 2.75 GiB already allocated; 24.19 MiB free; 2.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

batch size = 2 CUDA out of memory. Tried to allocate 62.00 MiB (GPU 0; 3.82 GiB total capacity; 2.58 GiB already allocated; 59.81 MiB free; 2.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I was wondering if I change the computer this or similar programs can work or I need to change something in the code? Thank you very much in advance

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you modified the open/close state.Message ID: @.***>

leilaUEA commented 1 year ago

Thank you very much for your response.

I am trying running on rtx-6000 with batch_size=8 on the High Performance Computing (HPC) and it is running without error until now. For the personal PC, we might be able to upgrade the GPU. As we cannot use trial and error to find the required specification, I was wondering if you know the minimum GPU specification that can run the training code? If the training can be run on a PC with a better GPU, what is the GPU that you have tested running the training? Many Thanks, Leila

AlexAbades commented 1 year ago

@leilaUEA @Walter0807 I am facing the same issue. I have an HPC avaliable to run the code and I am trying to just the train from scratch. But I am always getting the same error:

_torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 79.15 GiB total capacity; 75.23 GiB already allocated; 159.25 MiB free; 77.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOCCONF

Did you find which is the max_split_size_mb? to avoid fragmentation?

Thank you very much

Alex

Walter0807 / MotionBERT

Cuda Out of Memory and Minimum System Specification for running 3D human Pose Estimation #54