AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

Cuda out of memory error when training with MAP (without its fine) #8623

Closed holger-prause closed 2 years ago

holger-prause commented 2 years ago

Hello i am training a model with yolov4-tiny-3l.cfg.

When measuring the gpu consumption nvidia-smi reports: 16681MiB / 24268MiB

So the the training set fits perfectly into memory while still having some memory left. However - i get CUDA Error: an illegal memory access was encountered: File exists as soon as the map measuring starts.

I though the memory would be cleared before performing map measuring. But it does not seems to be the case or maybe something else is wrong(cuda / cudnn / opencv bug?). I build yolo on linux with opencv, cuda, and cudnn.

Is this behaviour normal or should i try updating cudnn / cuda or compile opencv without cuda support? For now all i can do is decrease the batch size - but then memory is only around 8000 MiB which is a kind of waste.

Please help me understanding this problem a bit more.

awaisbajwaml commented 2 years ago

what are your batch size and subdivisions?

In my experience, sometimes when I re-build it goes away.

I agree you have enough ram you should not only use 8gib its wastage of resources.

holger-prause commented 2 years ago

Hello, i did not received many answers on this topic- i want to try one more time as this is somewhat critical for me. To simplify things - let me try to give an example.

Note: I am using latest yolo version and 2 RTX 3090 GPU's for multi gpu training.

So when i do a regular training(no MAP flag) and settings in cfg: batch=64 subdivisions=2

The training works fine and nvidia-smi reports memory usage of 16681MiB / 24268MiB

As soon as i start training with map flag - i get CUDA Error: an illegal memory access was encountered: File exists as soon as the map measuring starts

When i change my cfg settings to batch=64 subdivisions=4

nvidia-smi reports: 8192MiB / 24268MiB

And there is also no error during map calculation. Its just that i waste tons of resources during training.

@AlexeyAB Please kindly comment on this issue - if the described behaviour is normal. I am not asking for a solution, just for clarification. The rest i can figure out myself (like maybe don't train with map flag and measure map myself after training with a script, do some code changes, keep low batch size, etc....)

Thank you very much + greetings, Holger

stephanecharette commented 2 years ago

Just in case it is related, see other tickets such as issue #8308.

holger-prause commented 2 years ago

Hello - thank you for responding - i did some research in the forum and stepped over your postings too. In the end i couldnt come to a conclusion - is this currently still happening for you?

I think the best i can do is training without validation dataset and measure map afterwards and then pick the best model version

holger-prause commented 2 years ago

Ok this problem still happens reproducable with latest code version, single or multi gpu does not matter. my workarounds is measuring the map after the training.

lrf19991230 commented 1 year ago

I have the same problem, when training with -map and the iteration comes to 1000(the time measuring map), I get the error: cuDNN Error: CUDNN_STATUS_BAD_PARAM my error is different from yours, but other things are the same

stephanecharette commented 1 year ago

Did you try what I mention in the comment above? https://github.com/AlexeyAB/darknet/issues/8623#issuecomment-1207483387