Closed holger-prause closed 2 years ago
what are your batch size and subdivisions?
In my experience, sometimes when I re-build it goes away.
I agree you have enough ram you should not only use 8gib its wastage of resources.
Hello, i did not received many answers on this topic- i want to try one more time as this is somewhat critical for me. To simplify things - let me try to give an example.
Note: I am using latest yolo version and 2 RTX 3090 GPU's for multi gpu training.
So when i do a regular training(no MAP flag) and settings in cfg: batch=64 subdivisions=2
The training works fine and nvidia-smi reports memory usage of 16681MiB / 24268MiB
As soon as i start training with map flag - i get CUDA Error: an illegal memory access was encountered: File exists as soon as the map measuring starts
When i change my cfg settings to batch=64 subdivisions=4
nvidia-smi reports: 8192MiB / 24268MiB
And there is also no error during map calculation. Its just that i waste tons of resources during training.
@AlexeyAB Please kindly comment on this issue - if the described behaviour is normal. I am not asking for a solution, just for clarification. The rest i can figure out myself (like maybe don't train with map flag and measure map myself after training with a script, do some code changes, keep low batch size, etc....)
Thank you very much + greetings, Holger
Just in case it is related, see other tickets such as issue #8308.
Hello - thank you for responding - i did some research in the forum and stepped over your postings too. In the end i couldnt come to a conclusion - is this currently still happening for you?
I think the best i can do is training without validation dataset and measure map afterwards and then pick the best model version
Ok this problem still happens reproducable with latest code version, single or multi gpu does not matter. my workarounds is measuring the map after the training.
I have the same problem, when training with -map and the iteration comes to 1000(the time measuring map), I get the error: cuDNN Error: CUDNN_STATUS_BAD_PARAM my error is different from yours, but other things are the same
Did you try what I mention in the comment above? https://github.com/AlexeyAB/darknet/issues/8623#issuecomment-1207483387
Hello i am training a model with yolov4-tiny-3l.cfg.
When measuring the gpu consumption nvidia-smi reports: 16681MiB / 24268MiB
So the the training set fits perfectly into memory while still having some memory left. However - i get CUDA Error: an illegal memory access was encountered: File exists as soon as the map measuring starts.
I though the memory would be cleared before performing map measuring. But it does not seems to be the case or maybe something else is wrong(cuda / cudnn / opencv bug?). I build yolo on linux with opencv, cuda, and cudnn.
Is this behaviour normal or should i try updating cudnn / cuda or compile opencv without cuda support? For now all i can do is decrease the batch size - but then memory is only around 8000 MiB which is a kind of waste.
Please help me understanding this problem a bit more.