Possible Memory Leak While Training

AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

http://pjreddie.com/darknet/

Other

21.8k stars 7.97k forks source link

Possible Memory Leak While Training #2937

Open willbattel opened 5 years ago

willbattel commented 5 years ago

I've been training various models on Tesla V100s for about 500 combined hours. The models include yolov3, yolov3-spp, yolov3-5l, yolov3-tiny, and yolov3-tiny-3l. For every model, on every training run, I have noticed that the main memory usage steadily increases. For example, on yolov3 with width=576, height=1024, batch=64, and subdivisions=32 the memory usage starts out around 8GB, and by the 10,000th iteration it reaches over 16GB. On yolov3-tiny it starts around 4GB and by the 10,000th iteration it reaches over 8GB. Eventually the training process is killed because there is no remaining memory to use. (console just says Killed with no other errors/info)

I compiled with the following options on the latest commit:

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=1
OPENMP=1
LIBSO=0
ZED_CAMERA=0

AlexeyAB commented 5 years ago

@willbattel Hi,

Do you use flags -map and/or -dont_show for training command?
Do you use random=1 in the last [yolo] layer in cfg-file?
What versions of CUDA, cuDNN and OpenCV do you use?
Can you compile with GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=0 AVX=0 OPENMP=0 and check wether yolov3-tiny.cfg training leads to Killed? And can you show screenshot of this Killed error?

willbattel commented 5 years ago

I'll get back to you on my next round of training.

mushan09 commented 5 years ago

@willbattel @AlexeyAB I've been training yolov3 with batch=64 and gpu=8, the free memory monotonically decreases while the buff/cache monotonically increases. In step 20,000, the memory is 309GB free, 117GB buff/cache, 76GB used, each step costs about 2 second, but in step 180,000, the memory becomes: 1GB free, 419GB buff/cache, 82GB used, and each step costs about 8 second. I also trained yolov3 with gpu=4 and the final memory is: 172GB free, 285GB buff/cache, 45GB used, this does not slow down training. I don't understand why the buff/cache will always grow instead of releasing.

AlexeyAB commented 5 years ago

@mushan09 Hi, Did you use -gpus 0,1,2,3,4,5,6,7 ? What versions of OpenCV, cuDNN and CUDA did you use?

in step 180,000, the memory becomes: 1GB free, 419GB buff/cache, 82GB used, and each step costs about 8 second.

buff/cache: cache - should not slow down processing buff - can slow donw for a small time only. Did you see Buffers in /proc/meminfo ? Also did you see how many avail mem ?

Did you see what value was in Loaded when the training was slowed down to 8 sec per iteration?

mushan09 commented 5 years ago

@AlexeyAB Thank you for your reply, i train the model on the cloud and only know CUDA version is 10.0. I get this mem info through the command top, and i guess the avail mem is the sum of the free and buff/cache. Some logs are as follows

120001: 0.452042, 0.477300 avg loss, 0.004000 rate, 1.722240 seconds, 61440512 images
Loaded: 0.000035 seconds
120002: 0.429643, 0.472534 avg loss, 0.004000 rate, 1.833752 seconds, 61441024 images
Loaded: 0.000049 seconds
120003: 0.491967, 0.474477 avg loss, 0.004000 rate, 1.820759 seconds, 61441536 images
Loaded: 0.000048 seconds
120032: 0.473519, 0.474382 avg loss, 0.004000 rate, 2.202599 seconds, 61456384 images
Loaded: 0.000042 seconds
......
186976: 0.376932, 0.377062 avg loss, 0.004000 rate, 14.710853 seconds, 95731712 images
Loaded: 8.299701 seconds
186977: 0.389399, 0.378296 avg loss, 0.004000 rate, 7.189977 seconds, 95732224 images
Loaded: 0.000036 seconds
186978: 0.350274, 0.375494 avg loss, 0.004000 rate, 8.750645 seconds, 95732736 images
Loaded: 10.670002 seconds
186979: 0.392814, 0.377226 avg loss, 0.004000 rate, 13.566894 seconds, 95733248 images
Loaded: 5.126933 seconds

AlexeyAB commented 5 years ago

@mushan09 Thanks! Did you use params GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 in the Makefile?

mushan09 commented 5 years ago

@AlexeyAB My params are GPU=1 CUDNN=1 CUDNN_HALF=0 OPENCV=0