GPU memory usage issue during training YOLO-S model

WYL-Projects commented 3 months ago

Dear author,

I have attempted to use yolov10-s to train on target detection data on NVIDIA GeForce RTX 3090 and NVIDIA A100-SXM4-40GB graphics cards, and I have only trained two categories, but the program occasionally gets an out of memory error, which interrupts the training of the program, and the model training is not stable unless I adjust the batch- size to 16, then the model training is more stable. I don't know why yolov10-s training consumes so much video memory, and it's hard to adjust the batch-size when the video memory is high and low. the details are as follows:

The training command is as follows: yolo detect train data=moca.yaml model=runs/detect/train_v4/train/weights/last.pt resume=True batch=24 device=0 workers=8 cache=False The GPU 0 card is a NVIDIA A100-SXM4-40GB with 40960MB of memory. The GPU usage is unstable during my training, the first few epochs of the training do not have torch.cuda.OutOfMemoryError: CUDA out of memory, but after a while of training I may get an error! CUDA out of memory, I feel this error is not stable, the following picture shows the training process. Also, I've been adjusting the batch_size from batch_size=32 to batch_size=24 and it's out of memory after a while. i'm curious why the yolov10-s model takes up so much memory?

Looking forward to the author's reply, thanks!

htwang14 commented 3 months ago

Hi, I run into the same issue. I'm training yolov10-n and the GPU memory grows as the training epoch increases and finally lead to the CUDA OOM error at around the 280-th out of the total 500 epochs. I already removed the training and validation samples that have too much (>=500) objects. Any chance you have already found a solution? Thanks!

arielkantorovich commented 2 months ago

HI, you succussed to get any details on this problem? I have similar problem and I work with NVIDIA RTX 4090

bluesy7585 commented 2 months ago

set workers to 2 helps avoid CUDA out of memory in my case.

THU-MIG / yolov10

GPU memory usage issue during training YOLO-S model #361