Yolov3 training killed halfway

zrion commented 5 years ago

Hello,

I'm training darknet to detect my custom objects. The process looks fine without error after loading, and during training. However, after a number of iterations (~1000), the process is killed, pretty randomly, sometimes during an iteration. What would be a possible cause and how it can be solved? Thank you.

I'm training on a Geforce GTX 1080, with CUDA 9.0, and 27GB CPU RAM:

Here is my cfg file:

[net]
# Testing
batch=1
subdivisions=1
# Training
#batch=64
#subdivisions=64
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

# Downsample

[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=32
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=128
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=256
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=512
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

######################

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=57
activation=linear

[yolo]
mask = 6,7,8
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=14
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 61

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=57
activation=linear

[yolo]
mask = 3,4,5
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=14
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=4

[route]
layers = -1, 11

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=57
activation=linear

[yolo]
mask = 0,1,2
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=14
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1
max=200

AlexeyAB commented 5 years ago

@zrion Hi,

What command do you use for training?
What params do you use in the Makefile?
Can you show your obj.data file?
What is in the bad.list and bad_label.list files?
Can you show screenshot of the error?

zrion commented 5 years ago

@AlexeyAB Hi,

* What command do you use for training?

I used ./darknet detector train <pretrained_weights(darknet53.conv.74) -map

* What params do you use in the Makefile?

GPU=1, LIBSO=1, others=0

* Can you show your `obj.data` file?

classes= 14 train = data/train_team_feb19.txt valid = data/train_team_feb19.txt names = data/obj_team_feb19.names backup = backup/

* What is in the `bad.list` and `bad_label.list` files?

Nothing in bad.list, I couldn't find bad_label.list

* Can you show screenshot of the error?

test

AlexeyAB commented 5 years ago

How much CPU-RAM do you have?
Do you get this error if you train without -map flag?
Do you use these line in traini cfg-file? (commented batch=1 and un-commented batch=64 subdivisions=64)
```
[net]
# Testing
#batch=1
#subdivisions=1
# Training
batch=64
subdivisions=64
```
Do you get this error if you train with -map flag and with batch=32 subdivisions=32 ?
Try to train with CUDNN=1 and installed cuDNN library: https://github.com/AlexeyAB/darknet#requirements

CUDNN=1 to build with cuDNN v5-v7 to accelerate training by using GPU - the cuDNN should be in /usr/local/cudnn

zrion commented 5 years ago

The output of dmesg: [6823238.306674] Out of memory: Kill process 2074 (darknet) score 949 or sacrifice child However, I have 32 GB RAM, so that should be sufficient...

I used batch=64 subdivisions=64 during training. I have CUDNN 7.1 installed, but it's not in /usr/local/cudnn/, does it affect? The conventional way to install cudnn is not in a cuda-separated direcrory https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html

AlexeyAB commented 5 years ago

@zrion So just compile with GPU=1 CUDNN=1 in the Makefile. Does it help to avoid this error?

zrion commented 5 years ago

@AlexeyAB So I'm using cfg file with batch=64 subdivisions=64 without -map, and with CUDNN=1. It looks good so far, will let you know if the error still happens. Thanks.

AlexeyAB commented 5 years ago

@zrion I fixed memory leak during training with flag -map and random=1 in cfg-file.

zrion commented 5 years ago

@AlexeyAB Cool! I didn't use -map flag and it has worked fine. Could you let me know which particular part did you fix? Because I did change some portions in my own code so I want to apply these fixes myself, thanks!

AlexeyAB commented 5 years ago

@zrion At least this one fix: https://github.com/AlexeyAB/darknet/commit/cad99fd75d944fda1d26f7e57e0cf5fb9d4fdf8f#diff-d77fa1db75cc45114696de9b1c005b26R288

Kyuuki93 commented 4 years ago

@AlexeyAB I meet same problem in latest repo, there 32G memory and 2x1080Ti cards, and process got killed every 10k iters, and in another 64G memory with 4x2080Ti cards machine process got killed every about 40k iters

the two cards machine trained with comd -map the four cards machine trained with comd -dont_show -map and without -map is same

AlexeyAB commented 4 years ago

@Kyuuki93

Do you use the latest Darknet version?
What params did you use in the Makefile?

Kyuuki93 commented 4 years ago

@AlexeyAB

Do you use the latest Darknet version?

yes, it's latest in that day, commit 10c40551dcadec6805befa6a1cecc6f69049d0d

What params did you use in the Makefile?

GPU = 1 CUDNN = 1 CUDNN_HALF = 0 OPENCV = 1 LIBSO = 0 ZED_CAMERA = 0 rest was default

AlexeyAB commented 4 years ago

@Kyuuki93 If you will train with only 1 GPU, will be training killed?

Kyuuki93 commented 4 years ago

@Kyuuki93 If you will train with only 1 GPU, will be training killed?

Let me try this, will feedback tomorrow.

Kyuuki93 commented 4 years ago

@Kyuuki93 If you will train with only 1 GPU, will be training killed?

@AlexeyAB 1 GPU works fine

AlexeyAB commented 4 years ago

@Kyuuki93

Install valgrind

sudo apt update
sudo apt install valgrind

Try to set max_batches=2000 and run training with Valgrind on multi-GPU

valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=valgrind-out.txt ./darknet classifier train .... -gpus 0,1

After 2000 iterations attach there valgrind-out.txt

I meet same problem in latest repo, there 32G memory and 2x1080Ti cards, and process got killed every 10k iters

Also try to increase subdivisions= twice (2x), and run training with -gpus 0,0 (both numbers are the same).

Will it be killed after 10k iters?

Kyuuki93 commented 4 years ago

Try to set max_batches=2000 and run training with Valgrind on multi-GPU

valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=valgrind-out.txt ./darknet classifier train .... -gpus 0,1

After 2000 iterations attach there valgrind-out.txt

This process got killed before training started,

This is valgrind log file valgrind-out.txt

Kyuuki93 commented 4 years ago

Also try to increase subdivisions= twice (2x), and run training with -gpus 0,0 (both numbers are the same).

Will it be killed after 10k iters?

@AlexeyAB yes, process got killed this way

AlexeyAB commented 4 years ago

This process got killed before training started,

Try to set width=320 height=320 and run multi-gpu training with valgrind again

Kyuuki93 commented 4 years ago

Try to set width=320 height=320 and run multi-gpu training with valgrind again

@AlexeyAB I realized that is memory limit to run valgrind, so I tried to double subdivisions and got this result, it’s passed about 10k iters, valgrind log valgrind-out2.zip

AlexeyAB commented 4 years ago

@AlexeyAB I realized that is memory limit to run valgrind, so I tried to double subdivisions and got this result, it’s passed about 10k iters, valgrind log valgrind-out2.zip

Thanks! So it was not killed? Did the training end automatically upon reaching max_batches=10000?

The kernel would only kill a process under exceptional circumstances such as extreme resource starvation (think mem+swap exhaustion).

==7914== LEAK SUMMARY:
==7914==    definitely lost: 5,408 bytes in 55 blocks
==7914==    indirectly lost: 512 bytes in 1 blocks
==7914==      possibly lost: 3,764,580 bytes in 28,938 blocks
==7914==    still reachable: 6,022,531,512 bytes in 767,536 blocks
==7914==                       of which reachable via heuristic:
==7914==                         length64           : 5,432 bytes in 77 blocks
==7914==                         newarray           : 1,760 bytes in 30 blocks
==7914==         suppressed: 0 bytes in 0 blocks

There is memory leak = 5 KB, Possibly 3.7 MB. So there is no noticeable memory leak.

Did you compile with GPU=1 CUDNN=1 OPENCV=1?
What command did you use, is it -map -gpus 0,1?
How many CPU-RAM did it consume at the end of training? approximately

AlexeyAB commented 4 years ago

@Kyuuki93 I fixed several minor bugs: https://github.com/AlexeyAB/darknet/commit/f1ffb09d8bffa82aa3cf0c2ca405763eaa591f59

So you can try to run training with multi-gpu and -map with valgrind again.

What OS do you use?
Do you use Geforce GTX 1080, with CUDA 9.0 ?
Can you show output of commands?
```
nvcc --version
nvidia-smi
```
possibly lost: 3,764,580 bytes - mainly related to gpuInfoRunsOn from libnvidia-fatbinaryloader.so.410.104 and cuDevicePrimaryCtxRetain from libcuda.so.410.104 and OpenCV-library - may be there is some bug in nVidia library or OpenCV: https://devtalk.nvidia.com/default/topic/1066511/tensorrt-5-1-6-1-cuda10-0-jetson-nano-memory-leakage/

Kyuuki93 commented 4 years ago

Ubuntu 18.04, CUDA 10.0, Nvidia driver 410.104, GeForce GTX 1080Ti.

Yes, darknet was compile with GPU=1 CUDNN=1 OPENCV=1, and will not get killed when doubled subdivisions

How to check CPU-RAM ? System Monitor?

AlexeyAB commented 4 years ago

System Monitor?

Yes,

Or top command in a separate terminal.

Kyuuki93 commented 4 years ago

@AlexeyAB

I used latest repo, and training after one night, it seems problem still exists, process stoped at 12737 iters, and this on the nvidia-drivers 410.104.

training command ./darknet detector train ... -gpus 0,1 -map

After this, I updated nvidia-drivers to nvidia-drivers 440.36, and process got killed again, System Monitor at training just started:

System Monitor at training 9226 iters:

AlexeyAB commented 4 years ago

@Kyuuki93 Can you try to run training with Valgrind again and put Output log?

AlexeyAB commented 4 years ago

@WongKinYiu You trained many models using this repository.

Did you have CPU-memory overflow while Training using https://github.com/AlexeyAB/darknet ?
Did you use Linux and GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=1 in the Makefile?

WongKinYiu commented 4 years ago

@AlexeyAB

Yes, I have trained models on 3 servers (3 Ubuntu 16) and 9 pcs (2 Windows 10, 2 Ubuntu 18, 5 Ubuntu 16) to check it. Only 1 of server (1 Ubuntu 16) and 2 of pcs (2 Windows 10) do not have this problem. Although CPU-ram usage won't growth in Windows 10, sometimes the CPU usage may decrease.
The repos before 4 March 2019 won't have this kind of problem (As I remember, I have asked this question before), so for training lightweight models, I add the functions I need into old repo (the CPU-memory overflow seems proportional to epochs, so a light-weight model may occupy ~80 GB CPU-ram when continuously training 1 day).
I use GPU=1, CUDNN=1, and OPENCV=1.
I think the problem maybe occur after add group convolutional function.

AlexeyAB commented 4 years ago

@WongKinYiu

4 March 2019

Ok I will try to find the reason.

so for training lightweight models, I add the functions I need into old repo

What functions did you add?

I think the problem maybe occur after add group convolutional function.

Do you mean that models without group-conv doesn't have this issue? Or do you mean that together with group-conv I added some another bug in another place?

WongKinYiu commented 4 years ago

@AlexeyAB

GIoU
SAM (there are some bugs in the repo, I fixed it also)
Maxpool across depth
Scale_x_y
Stride of Maxpool
~~anti-aliasing (old version)~~
~~assisted excitation (old version)~~

No, all of models have the issue. But above methods do not have the issue. So I think it happens in convolutional layer when adding group convolutional function.

AlexeyAB commented 4 years ago

@WongKinYiu

So you didn't use Cmake for Darknet compiling on both Linux and Windows PC? And do you have memory overflow issue for both training Detector and Classifier?

This is strange that Valgrind can't detect memory leaks.

Yes, I have trained models on 3 servers (3 Ubuntu 16) and 9 pcs (2 Windows 10, 2 Ubuntu 18, 5 Ubuntu 16) to check it. Only 1 of server (1 Ubuntu 16) and 2 of pcs (2 Windows 10) do not have this problem. Although CPU-ram usage won't growth in Windows 10, sometimes the CPU usage may decrease.

This is even more strange that this happens only on Linux, but not on Windows. So it seems the error in the 3rd party libraries: OpenCV, Pthread, STB, cuDNN, CUDA
And it is completely incomprehensible why this problem is on one Ubuntu 16, but not on another Ubuntu 16.

4 March 2019

From 03 March to 13 March 2019 there were changed only 3 c-files:

With only these changes:

There were many changes 2 March 2019 and 18 March 2019

WongKinYiu commented 4 years ago

@AlexeyAB

I use make on Ubuntu16/Ubuntu18, and VS2015/2017/2019 on Windows10.

And after 4 March 2019, I download new repo at 15 May 2019, then problem occurs.

Yes, both of classifier and detector. The memory usage grow faster when training a classifier, 3GB to >128GB (800k epochs). when training a detector, 4GB to ~60GB (500k epochs).

AlexeyAB commented 4 years ago

@WongKinYiu Do you train by using only 1 GPU and without -map flag, and get memory overflow issue?

WongKinYiu commented 4 years ago

@AlexeyAB

Yes, If use multiple GPUs, 6 hours can make 256GB ram full even training a detector. I use: CUDA_VISIBLE_DEVICES=0 ./darknet detector train coco.data coco.cfg coco.conv -dont_show -gpus 0

AlexeyAB commented 4 years ago

@WongKinYiu

Can you check this issue on the latest Darknet version just by using 3 tests on any small model, on PC where is memory overflow occurs?

Train Detector for the 200 iterations:

install Valgrind

sudo apt update
sudo apt install valgrind

set max_batches=200 burn_in=100 just for fast checking
and run: valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=valgrind-out_detector.txt ./darknet detector train coco.data coco.cfg coco.conv -dont_show -gpus=0

The same for training Classifier (but with OPENCV=0)

attach another valgrind-out_classifier.txt file

Train Classifier (with OPENCV=0) a lot of iterations, for example: max_batches=100000, so there will be occupied significant amount of memory

add there getchar(); and recompile: https://github.com/AlexeyAB/darknet/blob/efc5478a23a3a3c66d6feefc6d6b485f13503bde/src/detector.c#L1534
measure CPU-memory usage before runing Darknet
run training without valgrind
measure CPU-memory usage after completing the training (when the darknet will wait on getchar();)

It will help me to localize the problem and fix this issue.

WongKinYiu commented 4 years ago

@AlexeyAB

I will try to install Valgrind after finish my breakfast.

WongKinYiu commented 4 years ago

@AlexeyAB

I also set this as priority: high, now examine:

TBD (OPENCV=1); TBD (OPENCV=0).
valgrind-out_darknet_cuda_cudnn_avx_openmp_opencv.txt (OPENCV=1); valgrind-out_darknet_cuda_cudnn_avx_openmp.txt (OPENCV=0).
40k 10G, 160k 23.5G (OPENCV=1); 70k 4G (OPENCV=0).

opencv=1 has issue.

Kyuuki93 commented 4 years ago

@Kyuuki93 Can you try to run training with Valgrind again and put Output log?

valgrind-out.zip

double subdivisions or set image (h,w) from (416,416) to (320,320) or do both, the training was very slow, nearly stagnant.

When training with 1 GPU, CPU-MEM hold at 6.6GiB(21%) of 31.3GiB, there may be a problem in Syncing?

joelmatt commented 4 years ago

@AlexeyAB Does the reason for the error "Killed" remain same for the error "Segmentation fault (core dumped)". I am racking my brain around to find a solution for this error and i need you help please. after certain iterations i encounter this error

(next mAP calculation at 1000 iterations) 
 6: 1277.488403, 1277.129028 avg loss, 0.000000 rate, 2.127515 seconds, 192 images
Loaded: 0.000046 seconds
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 16 Avg (IOU: 0.460068, GIOU: 0.460068), Class: 0.394136, Obj: 0.560775, No Obj: 0.561543, .5R: 0.000000, .75R: 0.000000, count: 1
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 23 Avg (IOU: 0.287364, GIOU: 0.287364), Class: 0.568923, Obj: 0.385361, No Obj: 0.490695, .5R: 0.000000, .75R: 0.000000, count: 1
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 16 Avg (IOU: 0.314481, GIOU: 0.119691), Class: 0.339161, Obj: 0.574429, No Obj: 0.561120, .5R: 0.000000, .75R: 0.000000, count: 2
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 23 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.493843, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 16 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.559551, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 23 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.490230, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 16 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.559821, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 23 Avg (IOU: 0.312913, GIOU: 0.195579), Class: 0.433343, Obj: 0.407917, No Obj: 0.491107, .5R: 0.000000, .75R: 0.000000, count: 2
Segmentation fault (core dumped)

i tried change the width and heriht from 960 to 800 also making random from 1 to 0 and also batch and subdivisions from 64 to 32 but in vain.... Please Please can you help me out here

WongKinYiu commented 4 years ago

@joelmatt

do you train with cutmix=1 or other data augmentation methods? if yes, it will get segmentation fault using latest commit.

by the way, mosaic=1 works fine.

AlexeyAB commented 4 years ago

@Kyuuki93 Thanks!

When training with 1 GPU, CPU-MEM hold at 6.6GiB(21%) of 31.3GiB, there may be a problem in Syncing?

So you don't have memory overflow issue with 1 x GPU?

double subdivisions or set image (h,w) from (416,416) to (320,320) or do both, the training was very slow, nearly stagnant.

Training is slow? Or loss decreasing is slow? Or memory-usage increasing is slow?

valgrind-out.zip

How many iterations did you train by using Valgrind?
Have you seen an increase in memory consumption?
What command did you use?
Did you use GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=0 AVX=0 OPENMP=0 LIBSO=0 DEBUG=0 ?

There is not found memory leaks by Valgrind:

==31553== LEAK SUMMARY: ==31553== definitely lost: 0 bytes in 0 blocks ==31553== indirectly lost: 0 bytes in 0 blocks ==31553== possibly lost: 3,922,392 bytes in 30,358 blocks ==31553== still reachable: 6,017,118,411 bytes in 713,328 blocks ==31553== of which reachable via heuristic: ==31553== length64 : 5,560 bytes in 79 blocks ==31553== newarray : 3,632 bytes in 31 blocks ==31553== suppressed: 0 bytes in 0 blocks

Kyuuki93 commented 4 years ago

@AlexeyAB 1 iter (one batch) need more than 1 hour, this log file only record 10 iters for 10 hours, that slow

AlexeyAB commented 4 years ago

@WongKinYiu Thanks!

I also set this as priority: high, now examine:

TBD (OPENCV=1); TBD (OPENCV=0).

valgrind-out_darknet_cuda_cudnn_avx_openmp_opencv.txt (OPENCV=1); valgrind-out_darknet_cuda_cudnn_avx_openmp.txt (OPENCV=0).

It seems I fixed definitely lost & indirectly lost: https://github.com/AlexeyAB/darknet/commit/2207acd9c432669e3f4251107791a1df6c519d99

==185817== LEAK SUMMARY: ==185817== definitely lost: 24 bytes in 1 blocks ==185817== indirectly lost: 3,408 bytes in 20 blocks ==185817== possibly lost: 775,332 bytes in 5,985 blocks ==185817== still reachable: 3,394,423,999 bytes in 2,993,790 blocks ==185817== suppressed: 0 bytes in 0 blocks

40k 10G, 160k 23.5G (OPENCV=1); 70k 4G (OPENCV=0).

opencv=1 has issue.

So for Training Classifier there is issue only if OPENCV=1.
What command did you use for training classifier?
Can you try change #ifdef OPENCV to #ifdef OPENCV_DISABLED there, recompile with OPENCV=1 and check if there is a memory overflow here? https://github.com/AlexeyAB/darknet/blob/142fcdeb1e53ec78ec35d98503726075bd721a9b/src/image.c#L1424

AlexeyAB commented 4 years ago

@Kyuuki93

@AlexeyAB 1 iter (one batch) need more than 1 hour, this log file only record 10 iters for 10 hours, that slow

I know that DEBUG=1 can slows down 10x-100x training rather than increasing subdivisions= or decreasing width= height=.

So you don't have an issue with memory overflow if you compiled with

Without OpenCV: GPU=1 CUDNN=1 OPENCV=0 CUDNN_HALF=0 AVX=0 OPENMP=0 LIBSO=0 DEBUG=0
With OpenCV GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=0 AVX=0 OPENMP=0 LIBSO=0 DEBUG=0 , but only with 1 GPU?

Kyuuki93 commented 4 years ago

@Kyuuki93

@AlexeyAB 1 iter (one batch) need more than 1 hour, this log file only record 10 iters for 10 hours, that slow

I know that DEBUG=1 can slows down 10x-100x training rather than increasing subdivisions= or decreasing width= height=.

So you don't have an issue with memory overflow if you compiled with

Without OpenCV: GPU=1 CUDNN=1 OPENCV=0 CUDNN_HALF=0 AVX=0 OPENMP=0 LIBSO=0 DEBUG=0

With OpenCV GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=0 AVX=0 OPENMP=0 LIBSO=0 DEBUG=0 , but only with 1 GPU?

Ok, I set OPENCV=1 to OPENCV=0, training command was ./darknet detector train ... -gpus 0,1 -map, this is training just started:

It's seems still exists, after 2500 iters:

after 3500 iters:

AlexeyAB commented 4 years ago

@Kyuuki93 @WongKinYiu Also check your bad.list file. Clear it before training, and check it after training.

WongKinYiu commented 4 years ago

@AlexeyAB

What command did you use for training classifier?

I just replace detector train to classifier train.

Can you try change #ifdef OPENCV to #ifdef OPENCV_DISABLED there, recompile with OPENCV=1 and check if there is a memory overflow here?

Yes, I can.

AlexeyAB commented 4 years ago

@WongKinYiu Just for test try to train without CUDA_VISIBLE_DEVICES=0 and without -gpus 0

WongKinYiu commented 4 years ago

@AlexeyAB

update log files.

valgrind-out_v3tiny_cuda_cudnn_avx_openmp_opencv.txt (OPENCV=1); valgrind-out_v3tiny_cuda_cudnn_avx_openmp.txt (OPENCV=0).
~~valgrind-out_darknet_cuda_cudnn_avx_openmp_opencv.txt~~ valgrind-out_darknet_cuda_cudnn_avx_openmp_opencv.txt (OPENCV=1); ~~valgrind-out_darknet_cuda_cudnn_avx_openmp.txt~~ valgrind-out_darknet_cuda_cudnn_avx_openmp.txt (OPENCV=0).
40k 10G, 160k 23.5G (OPENCV=1); 70k 4G (OPENCV=0).

AlexeyAB commented 4 years ago

@WongKinYiu Thanks!

40k 10G, 160k 23.5G (OPENCV=1); 70k 4G (OPENCV=0).

Is this memory-usage During training or After training (while wait for keypress on getchar()?

measure CPU-memory usage after completing the training (when the darknet will wait on getchar();)

Do you use the latest version of https://github.com/AlexeyAB/darknet or your modified repo?

AlexeyAB / darknet

Yolov3 training killed halfway #2494