So i use 2 GPU's, a batch size of 16, an imgsz of 640.
Before we trained yolo8x models with a dataset of 500k images without any problems.
The traceback:
Traceback (most recent call last):
File "/home/rrt/.config/Ultralytics/DDP/_temp_g6fcpx_t139978855048832.py", line 12, in
results = trainer.train()
File "/home/rrt/.local/lib/python3.10/site-packages/ultralytics/engine/trainer.py", line 208, in train
self._do_train(world_size)
File "/home/rrt/.local/lib/python3.10/site-packages/ultralytics/engine/trainer.py", line 384, in _do_train
self.scaler.scale(self.loss).backward()
File "/home/rrt/.local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/rrt/.local/lib/python3.10/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
[2024-03-10 10:01:19,658] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 314767 closing signal SIGTERM
[2024-03-10 10:01:19,872] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 314766) of binary: /usr/bin/python
Maybe also important. When i use the yolov9c model everything works with ~50% gpu memory usage.
Hy.
When i try to train a yolov9e model the program terminates, because of a leak of CUDA memory. It happens either directly when the first epoch starts.
I use 2 RTX2080 Ti: Ultralytics YOLOv8.1.23 🚀 Python-3.10.12 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 11009MiB) CUDA:1 (NVIDIA GeForce RTX 2080 Ti, 11012MiB)
So i use 2 GPU's, a batch size of 16, an imgsz of 640. Before we trained yolo8x models with a dataset of 500k images without any problems.
The traceback: Traceback (most recent call last): File "/home/rrt/.config/Ultralytics/DDP/_temp_g6fcpx_t139978855048832.py", line 12, in
results = trainer.train()
File "/home/rrt/.local/lib/python3.10/site-packages/ultralytics/engine/trainer.py", line 208, in train
self._do_train(world_size)
File "/home/rrt/.local/lib/python3.10/site-packages/ultralytics/engine/trainer.py", line 384, in _do_train
self.scaler.scale(self.loss).backward()
File "/home/rrt/.local/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/rrt/.local/lib/python3.10/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
[2024-03-10 10:01:19,658] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 314767 closing signal SIGTERM
[2024-03-10 10:01:19,872] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 314766) of binary: /usr/bin/python
Maybe also important. When i use the yolov9c model everything works with ~50% gpu memory usage.