Closed zrion closed 3 years ago
@zrion Hi,
What command do you use for training?
What params do you use in the Makefile?
Can you show your obj.data
file?
What is in the bad.list
and bad_label.list
files?
Can you show screenshot of the error?
@AlexeyAB Hi,
* What command do you use for training?
I used ./darknet detector train
* What params do you use in the Makefile?
GPU=1, LIBSO=1, others=0
* Can you show your `obj.data` file?
classes= 14 train = data/train_team_feb19.txt valid = data/train_team_feb19.txt names = data/obj_team_feb19.names backup = backup/
* What is in the `bad.list` and `bad_label.list` files?
Nothing in bad.list, I couldn't find bad_label.list
* Can you show screenshot of the error?
How much CPU-RAM do you have?
Do you get this error if you train without -map
flag?
Do you use these line in traini cfg-file? (commented batch=1 and un-commented batch=64 subdivisions=64)
[net]
# Testing
#batch=1
#subdivisions=1
# Training
batch=64
subdivisions=64
Do you get this error if you train with -map
flag and with batch=32 subdivisions=32
?
Try to train with CUDNN=1
and installed cuDNN library: https://github.com/AlexeyAB/darknet#requirements
CUDNN=1 to build with cuDNN v5-v7 to accelerate training by using GPU - the cuDNN should be in /usr/local/cudnn
The output of dmesg
:
[6823238.306674] Out of memory: Kill process 2074 (darknet) score 949 or sacrifice child
However, I have 32 GB RAM, so that should be sufficient...
I used batch=64 subdivisions=64 during training.
I have CUDNN 7.1 installed, but it's not in /usr/local/cudnn/
, does it affect? The conventional way to install cudnn is not in a cuda-separated direcrory https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html
@zrion So just compile with GPU=1 CUDNN=1 in the Makefile. Does it help to avoid this error?
@AlexeyAB So I'm using cfg file with batch=64 subdivisions=64 without -map, and with CUDNN=1. It looks good so far, will let you know if the error still happens. Thanks.
@zrion I fixed memory leak during training with flag -map
and random=1
in cfg-file.
@AlexeyAB Cool! I didn't use -map flag and it has worked fine. Could you let me know which particular part did you fix? Because I did change some portions in my own code so I want to apply these fixes myself, thanks!
@AlexeyAB I meet same problem in latest repo, there 32G memory and 2x1080Ti cards, and process got killed every 10k iters, and in another 64G memory with 4x2080Ti cards machine process got killed every about 40k iters
the two cards machine trained with comd -map
the four cards machine trained with comd -dont_show -map
and without -map
is same
@Kyuuki93
@AlexeyAB
- Do you use the latest Darknet version?
yes, it's latest in that day, commit 10c40551dcadec6805befa6a1cecc6f69049d0d
- What params did you use in the Makefile?
GPU = 1 CUDNN = 1 CUDNN_HALF = 0 OPENCV = 1 LIBSO = 0 ZED_CAMERA = 0
rest was default
@Kyuuki93 If you will train with only 1 GPU, will be training killed?
@Kyuuki93 If you will train with only 1 GPU, will be training killed?
Let me try this, will feedback tomorrow.
@Kyuuki93 If you will train with only 1 GPU, will be training killed?
@AlexeyAB 1 GPU works fine
@Kyuuki93
Install valgrind
sudo apt update
sudo apt install valgrind
Try to set max_batches=2000 and run training with Valgrind on multi-GPU
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=valgrind-out.txt ./darknet classifier train .... -gpus 0,1
After 2000 iterations attach there valgrind-out.txt
I meet same problem in latest repo, there 32G memory and 2x1080Ti cards, and process got killed every 10k iters
Also try to increase subdivisions= twice (2x), and run training with -gpus 0,0
(both numbers are the same).
Will it be killed after 10k iters?
Try to set max_batches=2000 and run training with Valgrind on multi-GPU
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=valgrind-out.txt ./darknet classifier train .... -gpus 0,1
After 2000 iterations attach there
valgrind-out.txt
This process got killed before training started,
This is valgrind log file valgrind-out.txt
Also try to increase subdivisions= twice (2x), and run training with
-gpus 0,0
(both numbers are the same).Will it be killed after 10k iters?
@AlexeyAB yes, process got killed this way
This process got killed before training started,
Try to set width=320 height=320 and run multi-gpu training with valgrind again
Try to set width=320 height=320 and run multi-gpu training with valgrind again
@AlexeyAB I realized that is memory limit to run valgrind, so I tried to double subdivisions and got this result, it’s passed about 10k iters, valgrind log valgrind-out2.zip
@AlexeyAB I realized that is memory limit to run valgrind, so I tried to double subdivisions and got this result, it’s passed about 10k iters, valgrind log valgrind-out2.zip
Thanks! So it was not killed? Did the training end automatically upon reaching max_batches=10000?
The kernel would only kill a process under exceptional circumstances such as extreme resource starvation (think mem+swap exhaustion).
==7914== LEAK SUMMARY:
==7914== definitely lost: 5,408 bytes in 55 blocks
==7914== indirectly lost: 512 bytes in 1 blocks
==7914== possibly lost: 3,764,580 bytes in 28,938 blocks
==7914== still reachable: 6,022,531,512 bytes in 767,536 blocks
==7914== of which reachable via heuristic:
==7914== length64 : 5,432 bytes in 77 blocks
==7914== newarray : 1,760 bytes in 30 blocks
==7914== suppressed: 0 bytes in 0 blocks
There is memory leak = 5 KB, Possibly 3.7 MB. So there is no noticeable memory leak.
-map -gpus 0,1
?@Kyuuki93 I fixed several minor bugs: https://github.com/AlexeyAB/darknet/commit/f1ffb09d8bffa82aa3cf0c2ca405763eaa591f59
So you can try to run training with multi-gpu and -map with valgrind again.
nvcc --version
nvidia-smi
gpuInfoRunsOn
from libnvidia-fatbinaryloader.so.410.104
and cuDevicePrimaryCtxRetain
from libcuda.so.410.104
and OpenCV-library - may be there is some bug in nVidia library or OpenCV: https://devtalk.nvidia.com/default/topic/1066511/tensorrt-5-1-6-1-cuda10-0-jetson-nano-memory-leakage/Ubuntu 18.04, CUDA 10.0, Nvidia driver 410.104, GeForce GTX 1080Ti.
Yes, darknet was compile with GPU=1 CUDNN=1 OPENCV=1, and will not get killed when doubled subdivisions
How to check CPU-RAM ? System Monitor?
System Monitor?
Yes,
Or top
command in a separate terminal.
@AlexeyAB
I used latest repo, and training after one night, it seems problem still exists, process stoped at 12737 iters, and this on the nvidia-drivers 410.104
.
training command ./darknet detector train ... -gpus 0,1 -map
After this, I updated nvidia-drivers to nvidia-drivers 440.36
, and process got killed again,
System Monitor
at training just started:
System Monitor
at training 9226 iters:
@Kyuuki93 Can you try to run training with Valgrind again and put Output log?
@WongKinYiu You trained many models using this repository.
@AlexeyAB
@WongKinYiu
4 March 2019
Ok I will try to find the reason.
so for training lightweight models, I add the functions I need into old repo
What functions did you add?
I think the problem maybe occur after add group convolutional function.
Do you mean that models without group-conv doesn't have this issue? Or do you mean that together with group-conv I added some another bug in another place?
@AlexeyAB
No, all of models have the issue. But above methods do not have the issue. So I think it happens in convolutional layer when adding group convolutional function.
@WongKinYiu
So you didn't use Cmake for Darknet compiling on both Linux and Windows PC? And do you have memory overflow issue for both training Detector and Classifier?
Yes, I have trained models on 3 servers (3 Ubuntu 16) and 9 pcs (2 Windows 10, 2 Ubuntu 18, 5 Ubuntu 16) to check it. Only 1 of server (1 Ubuntu 16) and 2 of pcs (2 Windows 10) do not have this problem. Although CPU-ram usage won't growth in Windows 10, sometimes the CPU usage may decrease.
This is even more strange that this happens only on Linux, but not on Windows. So it seems the error in the 3rd party libraries: OpenCV, Pthread, STB, cuDNN, CUDA
And it is completely incomprehensible why this problem is on one Ubuntu 16, but not on another Ubuntu 16.
4 March 2019
From 03 March
to 13 March
2019 there were changed only 3 c-files:
With only these changes:
There were many changes 2 March 2019 and 18 March 2019
@AlexeyAB
I use make
on Ubuntu16/Ubuntu18, and VS2015/2017/2019
on Windows10.
And after 4 March 2019, I download new repo at 15 May 2019, then problem occurs.
Yes, both of classifier and detector. The memory usage grow faster when training a classifier, 3GB to >128GB (800k epochs). when training a detector, 4GB to ~60GB (500k epochs).
@WongKinYiu
Do you train by using only 1 GPU and without -map
flag, and get memory overflow issue?
@AlexeyAB
Yes, If use multiple GPUs, 6 hours can make 256GB ram full even training a detector.
I use:
CUDA_VISIBLE_DEVICES=0 ./darknet detector train coco.data coco.cfg coco.conv -dont_show -gpus 0
@WongKinYiu
Can you check this issue on the latest Darknet version just by using 3 tests on any small model, on PC where is memory overflow occurs?
install Valgrind
sudo apt update
sudo apt install valgrind
set max_batches=200 burn_in=100
just for fast checking
and run:
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=valgrind-out_detector.txt ./darknet detector train coco.data coco.cfg coco.conv -dont_show -gpus=0
valgrind-out_classifier.txt
filemax_batches=100000
, so there will be occupied significant amount of memoryadd there getchar();
and recompile: https://github.com/AlexeyAB/darknet/blob/efc5478a23a3a3c66d6feefc6d6b485f13503bde/src/detector.c#L1534
measure CPU-memory usage before runing Darknet
run training without valgrind
measure CPU-memory usage after completing the training (when the darknet will wait on getchar();
)
It will help me to localize the problem and fix this issue.
@AlexeyAB
I will try to install Valgrind after finish my breakfast.
@AlexeyAB
I also set this as priority: high
, now examine:
TBD (OPENCV=1); TBD (OPENCV=0).
valgrind-out_darknet_cuda_cudnn_avx_openmp_opencv.txt (OPENCV=1); valgrind-out_darknet_cuda_cudnn_avx_openmp.txt (OPENCV=0).
40k 10G, 160k 23.5G (OPENCV=1); 70k 4G (OPENCV=0).
@Kyuuki93 Can you try to run training with Valgrind again and put Output log?
double subdivisions or set image (h,w) from (416,416) to (320,320) or do both, the training was very slow, nearly stagnant.
When training with 1 GPU, CPU-MEM hold at 6.6GiB(21%) of 31.3GiB
, there may be a problem in Syncing
?
@AlexeyAB Does the reason for the error "Killed" remain same for the error "Segmentation fault (core dumped)". I am racking my brain around to find a solution for this error and i need you help please. after certain iterations i encounter this error
(next mAP calculation at 1000 iterations)
6: 1277.488403, 1277.129028 avg loss, 0.000000 rate, 2.127515 seconds, 192 images
Loaded: 0.000046 seconds
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 16 Avg (IOU: 0.460068, GIOU: 0.460068), Class: 0.394136, Obj: 0.560775, No Obj: 0.561543, .5R: 0.000000, .75R: 0.000000, count: 1
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 23 Avg (IOU: 0.287364, GIOU: 0.287364), Class: 0.568923, Obj: 0.385361, No Obj: 0.490695, .5R: 0.000000, .75R: 0.000000, count: 1
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 16 Avg (IOU: 0.314481, GIOU: 0.119691), Class: 0.339161, Obj: 0.574429, No Obj: 0.561120, .5R: 0.000000, .75R: 0.000000, count: 2
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 23 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.493843, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 16 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.559551, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 23 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.490230, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 16 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.559821, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 23 Avg (IOU: 0.312913, GIOU: 0.195579), Class: 0.433343, Obj: 0.407917, No Obj: 0.491107, .5R: 0.000000, .75R: 0.000000, count: 2
Segmentation fault (core dumped)
i tried change the width and heriht from 960 to 800 also making random from 1 to 0 and also batch and subdivisions from 64 to 32 but in vain.... Please Please can you help me out here
@joelmatt
do you train with cutmix=1
or other data augmentation methods?
if yes, it will get segmentation fault using latest commit.
by the way, mosaic=1
works fine.
@Kyuuki93 Thanks!
When training with 1 GPU, CPU-MEM hold at
6.6GiB(21%) of 31.3GiB
, there may be a problem inSyncing
?
So you don't have memory overflow issue with 1 x GPU?
double subdivisions or set image (h,w) from (416,416) to (320,320) or do both, the training was very slow, nearly stagnant.
Training is slow? Or loss decreasing is slow? Or memory-usage increasing is slow?
There is not found memory leaks by Valgrind:
==31553== LEAK SUMMARY: ==31553== definitely lost: 0 bytes in 0 blocks ==31553== indirectly lost: 0 bytes in 0 blocks ==31553== possibly lost: 3,922,392 bytes in 30,358 blocks ==31553== still reachable: 6,017,118,411 bytes in 713,328 blocks ==31553== of which reachable via heuristic: ==31553== length64 : 5,560 bytes in 79 blocks ==31553== newarray : 3,632 bytes in 31 blocks ==31553== suppressed: 0 bytes in 0 blocks
@AlexeyAB 1 iter (one batch) need more than 1 hour, this log file only record 10 iters for 10 hours, that slow
@WongKinYiu Thanks!
I also set this as
priority: high
, now examine:
TBD (OPENCV=1); TBD (OPENCV=0).
valgrind-out_darknet_cuda_cudnn_avx_openmp_opencv.txt (OPENCV=1); valgrind-out_darknet_cuda_cudnn_avx_openmp.txt (OPENCV=0).
It seems I fixed definitely lost & indirectly lost: https://github.com/AlexeyAB/darknet/commit/2207acd9c432669e3f4251107791a1df6c519d99
==185817== LEAK SUMMARY: ==185817== definitely lost: 24 bytes in 1 blocks ==185817== indirectly lost: 3,408 bytes in 20 blocks ==185817== possibly lost: 775,332 bytes in 5,985 blocks ==185817== still reachable: 3,394,423,999 bytes in 2,993,790 blocks ==185817== suppressed: 0 bytes in 0 blocks
- 40k 10G, 160k 23.5G (OPENCV=1); 70k 4G (OPENCV=0).
- opencv=1 has issue.
OPENCV=1
.#ifdef OPENCV
to #ifdef OPENCV_DISABLED
there, recompile with OPENCV=1 and check if there is a memory overflow here?
https://github.com/AlexeyAB/darknet/blob/142fcdeb1e53ec78ec35d98503726075bd721a9b/src/image.c#L1424@Kyuuki93
@AlexeyAB 1 iter (one batch) need more than 1 hour, this log file only record 10 iters for 10 hours, that slow
I know that DEBUG=1 can slows down 10x-100x training rather than increasing subdivisions= or decreasing width= height=.
So you don't have an issue with memory overflow if you compiled with
Without OpenCV: GPU=1 CUDNN=1 OPENCV=0 CUDNN_HALF=0 AVX=0 OPENMP=0 LIBSO=0 DEBUG=0
With OpenCV GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=0 AVX=0 OPENMP=0 LIBSO=0 DEBUG=0
, but only with 1 GPU?
@Kyuuki93
@AlexeyAB 1 iter (one batch) need more than 1 hour, this log file only record 10 iters for 10 hours, that slow
I know that DEBUG=1 can slows down 10x-100x training rather than increasing subdivisions= or decreasing width= height=.
So you don't have an issue with memory overflow if you compiled with
- Without OpenCV:
GPU=1 CUDNN=1 OPENCV=0 CUDNN_HALF=0 AVX=0 OPENMP=0 LIBSO=0 DEBUG=0
- With OpenCV
GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=0 AVX=0 OPENMP=0 LIBSO=0 DEBUG=0
, but only with 1 GPU?
Ok, I set OPENCV=1 to OPENCV=0, training command was ./darknet detector train ... -gpus 0,1 -map
, this is training just started:
It's seems still exists, after 2500 iters:
after 3500 iters:
@Kyuuki93 @WongKinYiu Also check your bad.list
file. Clear it before training, and check it after training.
@AlexeyAB
- What command did you use for training classifier?
I just replace detector train to classifier train.
- Can you try change
#ifdef OPENCV
to#ifdef OPENCV_DISABLED
there, recompile with OPENCV=1 and check if there is a memory overflow here?
Yes, I can.
@WongKinYiu
Just for test try to train without CUDA_VISIBLE_DEVICES=0
and without -gpus 0
@AlexeyAB
update log files.
valgrind-out_v3tiny_cuda_cudnn_avx_openmp_opencv.txt (OPENCV=1); valgrind-out_v3tiny_cuda_cudnn_avx_openmp.txt (OPENCV=0).
valgrind-out_darknet_cuda_cudnn_avx_openmp_opencv.txt
valgrind-out_darknet_cuda_cudnn_avx_openmp_opencv.txt (OPENCV=1);
valgrind-out_darknet_cuda_cudnn_avx_openmp.txt
valgrind-out_darknet_cuda_cudnn_avx_openmp.txt (OPENCV=0).
40k 10G, 160k 23.5G (OPENCV=1); 70k 4G (OPENCV=0).
@WongKinYiu Thanks!
40k 10G, 160k 23.5G (OPENCV=1); 70k 4G (OPENCV=0).
Is this memory-usage During training or After training (while wait for keypress on getchar()?
measure CPU-memory usage after completing the training (when the darknet will wait on getchar();)
Do you use the latest version of https://github.com/AlexeyAB/darknet or your modified repo?
Hello,
I'm training darknet to detect my custom objects. The process looks fine without error after loading, and during training. However, after a number of iterations (~1000), the process is killed, pretty randomly, sometimes during an iteration. What would be a possible cause and how it can be solved? Thank you.
I'm training on a Geforce GTX 1080, with CUDA 9.0, and 27GB CPU RAM:
Here is my
cfg
file: