yolov9e model allocates all available memory and fails

Hy.

When i try to train a yolov9e model the program terminates, because of a leak of CUDA memory. It happens either directly when the first epoch starts.

I use 2 RTX2080 Ti: Ultralytics YOLOv8.1.23 🚀 Python-3.10.12 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 11009MiB) CUDA:1 (NVIDIA GeForce RTX 2080 Ti, 11012MiB)

So i use 2 GPU's, a batch size of 16, an imgsz of 640. Before we trained yolo8x models with a dataset of 500k images without any problems.

The traceback: Traceback (most recent call last): File "/home/rrt/.config/Ultralytics/DDP/_temp_g6fcpx_t139978855048832.py", line 12, in results = trainer.train() File "/home/rrt/.local/lib/python3.10/site-packages/ultralytics/engine/trainer.py", line 208, in train self._do_train(world_size) File "/home/rrt/.local/lib/python3.10/site-packages/ultralytics/engine/trainer.py", line 384, in _do_train self.scaler.scale(self.loss).backward() File "/home/rrt/.local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/rrt/.local/lib/python3.10/site-packages/torch/autograd/init.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED [2024-03-10 10:01:19,658] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 314767 closing signal SIGTERM [2024-03-10 10:01:19,872] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 314766) of binary: /usr/bin/python

Maybe also important. When i use the yolov9c model everything works with ~50% gpu memory usage.

WongKinYiu / yolov9

yolov9e model allocates all available memory and fails #215