Open willbattel opened 5 years ago
@willbattel Hi,
Do you use flags -map
and/or -dont_show
for training command?
Do you use random=1
in the last [yolo]
layer in cfg-file?
What versions of CUDA, cuDNN and OpenCV do you use?
Can you compile with GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=0 AVX=0 OPENMP=0
and check wether yolov3-tiny.cfg
training leads to Killed
? And can you show screenshot of this Killed
error?
I'll get back to you on my next round of training.
@willbattel @AlexeyAB I've been training yolov3 with batch=64 and gpu=8, the free memory monotonically decreases while the buff/cache monotonically increases. In step 20,000, the memory is 309GB free, 117GB buff/cache, 76GB used, each step costs about 2 second, but in step 180,000, the memory becomes: 1GB free, 419GB buff/cache, 82GB used, and each step costs about 8 second. I also trained yolov3 with gpu=4 and the final memory is: 172GB free, 285GB buff/cache, 45GB used, this does not slow down training. I don't understand why the buff/cache will always grow instead of releasing.
@mushan09 Hi,
Did you use -gpus 0,1,2,3,4,5,6,7
?
What versions of OpenCV, cuDNN and CUDA did you use?
in step 180,000, the memory becomes: 1GB free, 419GB buff/cache, 82GB used, and each step costs about 8 second.
buff/cache:
cache - should not slow down processing
buff - can slow donw for a small time only. Did you see Buffers
in /proc/meminfo
?
Also did you see how many avail mem
?
Did you see what value was in Loaded
when the training was slowed down to 8 sec per iteration?
@AlexeyAB Thank you for your reply, i train the model on the cloud and only know CUDA version is 10.0. I get this mem info through the command top
, and i guess the avail mem is the sum of the free and buff/cache.
Some logs are as follows
120001: 0.452042, 0.477300 avg loss, 0.004000 rate, 1.722240 seconds, 61440512 images
Loaded: 0.000035 seconds
120002: 0.429643, 0.472534 avg loss, 0.004000 rate, 1.833752 seconds, 61441024 images
Loaded: 0.000049 seconds
120003: 0.491967, 0.474477 avg loss, 0.004000 rate, 1.820759 seconds, 61441536 images
Loaded: 0.000048 seconds
120032: 0.473519, 0.474382 avg loss, 0.004000 rate, 2.202599 seconds, 61456384 images
Loaded: 0.000042 seconds
......
186976: 0.376932, 0.377062 avg loss, 0.004000 rate, 14.710853 seconds, 95731712 images
Loaded: 8.299701 seconds
186977: 0.389399, 0.378296 avg loss, 0.004000 rate, 7.189977 seconds, 95732224 images
Loaded: 0.000036 seconds
186978: 0.350274, 0.375494 avg loss, 0.004000 rate, 8.750645 seconds, 95732736 images
Loaded: 10.670002 seconds
186979: 0.392814, 0.377226 avg loss, 0.004000 rate, 13.566894 seconds, 95733248 images
Loaded: 5.126933 seconds
@mushan09 Thanks! Did you use params GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1
in the Makefile?
@AlexeyAB My params are GPU=1 CUDNN=1 CUDNN_HALF=0 OPENCV=0
I've been training various models on Tesla V100s for about 500 combined hours. The models include
yolov3
,yolov3-spp
,yolov3-5l
,yolov3-tiny
, andyolov3-tiny-3l
. For every model, on every training run, I have noticed that the main memory usage steadily increases. For example, onyolov3
withwidth=576
,height=1024
,batch=64
, andsubdivisions=32
the memory usage starts out around 8GB, and by the 10,000th iteration it reaches over 16GB. Onyolov3-tiny
it starts around 4GB and by the 10,000th iteration it reaches over 8GB. Eventually the training process is killed because there is no remaining memory to use. (console just saysKilled
with no other errors/info)I compiled with the following options on the latest commit: