Cude memory error while Training 3D pose

Hello,

I am trying to replicate the results from Pose3D. I am running the code in an high performace computer (HPC), where I have multiple nodes with high RAM avaliable. My sh configuration is:

#!/bin/sh
#BSUB -q gpua100
#BSUB -J motionBERT
#BSUB -W 23:00
#BSUB -B
#BSUB -N
#BSUB -gpu "num=1:mode=exclusive_process"
#BSUB -n 4
#BSUB -R "span[hosts=1]"
#BSUB -R "rusage[mem=4GB]"
#BSUB -o logs/%J.out
#BSUB -e logs/%J.err

module load cuda/11.6
module load gcc/10.3.0-binutils-2.36.1
source /zhome/c0/a/164613/miniconda3/etc/profile.d/conda.sh
conda activate /work3/s212784/conda/env/motionbert
python train.py --config configs/pose3d/MB_train_h36m.yaml --checkpoint checkpoint/pose3d/MB_train_h36m

Which gives me the following resources:

CPU time :                                   29.94 sec.
Max Memory :                              80 MB
Average Memory :                       80.00 MB
Total Requested Memory :           16384.00 MB
Delta Memory :                            16304.00 MB
Max Swap :                                  16 MB
Max Processes :                          4
Max Threads :                              5
Run time :                                    152 sec.
Turnaround time :                        119 sec.

The output error that I am facing is the following:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 79.15 GiB total capacity; 75.23 GiB already allocated; 159.25 MiB free; 77.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have tried to change the max_split_size_mb with the following line in the jobscript.sh (before python run.. ) : export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb=512"

Although I am reaching the same problem, I can see that the memory usage, average and mean are increasing: Resource usage summary:

CPU time :                                  30.63 sec.
Max Memory :                            2057 MB
Average Memory :                      1391.67 MB
Total Requested Memory :         16384.00 MB
Delta Memory :                           14327.00 MB
Max Swap :                                16 MB
Max Processes :                         4
Max Threads :                            8
Run time :                                   145 sec.
Turnaround time :                       122 sec.

Nevertheless, when increasing the value too much it drops again... export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb=1024" Resource usage summary:

CPU time :                                  30.63 sec.
Max Memory :                            2057 MB
Average Memory :                      1391.67 MB
Total Requested Memory :         16384.00 MB
Delta Memory :                           14327.00 MB
Max Swap :                                16 MB
Max Processes :                         4
Max Threads :                            8
Run time :                                   145 sec.
Turnaround time :                       122 sec.

Therefore, I am not sure if this might be the problem...

Aditionally, (I don't know if relevant) before changing the max_split_size_mb, I thought that it was a problem from resources in the HPC so tried to run it multiple times. One thing that I notieced it's that the allocated memory was not always the same:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 518.00 MiB (GPU 0; 15.77 GiB total capacity; 13.82 GiB already allocated; 275.06 MiB free; 14.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I am in the right path? Is the fragmentation causing the problem? Is there a way to know the exact number it should be there?

Thank you very much,

Best Regards,

Alex Abades

Walter0807 / MotionBERT

Cude memory error while Training 3D pose #98