AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.77k stars 7.96k forks source link

Yolov3 training killed halfway #2494

Closed zrion closed 3 years ago

zrion commented 5 years ago

Hello,

I'm training darknet to detect my custom objects. The process looks fine without error after loading, and during training. However, after a number of iterations (~1000), the process is killed, pretty randomly, sometimes during an iteration. What would be a possible cause and how it can be solved? Thank you.

I'm training on a Geforce GTX 1080, with CUDA 9.0, and 27GB CPU RAM:

Here is my cfg file:

[net]
# Testing
batch=1
subdivisions=1
# Training
#batch=64
#subdivisions=64
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

# Downsample

[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=32
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=128
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=256
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=512
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

# Downsample

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

######################

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=57
activation=linear

[yolo]
mask = 6,7,8
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=14
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 61

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=512
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=57
activation=linear

[yolo]
mask = 3,4,5
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=14
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=4

[route]
layers = -1, 11

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=256
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=57
activation=linear

[yolo]
mask = 0,1,2
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=14
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1
max=200
AlexeyAB commented 5 years ago

@zrion Hi,

zrion commented 5 years ago

@AlexeyAB Hi,

* What command do you use for training?

I used ./darknet detector train <pretrained_weights(darknet53.conv.74) -map

* What params do you use in the Makefile?

GPU=1, LIBSO=1, others=0

* Can you show your `obj.data` file?

classes= 14 train = data/train_team_feb19.txt valid = data/train_team_feb19.txt names = data/obj_team_feb19.names backup = backup/

* What is in the `bad.list` and `bad_label.list` files?

Nothing in bad.list, I couldn't find bad_label.list

* Can you show screenshot of the error?

test

AlexeyAB commented 5 years ago

CUDNN=1 to build with cuDNN v5-v7 to accelerate training by using GPU - the cuDNN should be in /usr/local/cudnn

zrion commented 5 years ago

The output of dmesg: [6823238.306674] Out of memory: Kill process 2074 (darknet) score 949 or sacrifice child However, I have 32 GB RAM, so that should be sufficient...

I used batch=64 subdivisions=64 during training. I have CUDNN 7.1 installed, but it's not in /usr/local/cudnn/, does it affect? The conventional way to install cudnn is not in a cuda-separated direcrory https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html

AlexeyAB commented 5 years ago

@zrion So just compile with GPU=1 CUDNN=1 in the Makefile. Does it help to avoid this error?

zrion commented 5 years ago

@AlexeyAB So I'm using cfg file with batch=64 subdivisions=64 without -map, and with CUDNN=1. It looks good so far, will let you know if the error still happens. Thanks.

AlexeyAB commented 5 years ago

@zrion I fixed memory leak during training with flag -map and random=1 in cfg-file.

zrion commented 5 years ago

@AlexeyAB Cool! I didn't use -map flag and it has worked fine. Could you let me know which particular part did you fix? Because I did change some portions in my own code so I want to apply these fixes myself, thanks!

AlexeyAB commented 5 years ago

@zrion At least this one fix: https://github.com/AlexeyAB/darknet/commit/cad99fd75d944fda1d26f7e57e0cf5fb9d4fdf8f#diff-d77fa1db75cc45114696de9b1c005b26R288

Kyuuki93 commented 4 years ago

@AlexeyAB I meet same problem in latest repo, there 32G memory and 2x1080Ti cards, and process got killed every 10k iters, and in another 64G memory with 4x2080Ti cards machine process got killed every about 40k iters

the two cards machine trained with comd -map the four cards machine trained with comd -dont_show -map and without -map is same

AlexeyAB commented 4 years ago

@Kyuuki93

Kyuuki93 commented 4 years ago

@AlexeyAB

  • Do you use the latest Darknet version?

yes, it's latest in that day, commit 10c40551dcadec6805befa6a1cecc6f69049d0d

  • What params did you use in the Makefile?

GPU = 1 CUDNN = 1 CUDNN_HALF = 0 OPENCV = 1 LIBSO = 0 ZED_CAMERA = 0 rest was default

AlexeyAB commented 4 years ago

@Kyuuki93 If you will train with only 1 GPU, will be training killed?

Kyuuki93 commented 4 years ago

@Kyuuki93 If you will train with only 1 GPU, will be training killed?

Let me try this, will feedback tomorrow.

Kyuuki93 commented 4 years ago

@Kyuuki93 If you will train with only 1 GPU, will be training killed?

@AlexeyAB 1 GPU works fine

AlexeyAB commented 4 years ago

@Kyuuki93

Install valgrind

sudo apt update
sudo apt install valgrind

Try to set max_batches=2000 and run training with Valgrind on multi-GPU

valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=valgrind-out.txt ./darknet classifier train .... -gpus 0,1

After 2000 iterations attach there valgrind-out.txt


I meet same problem in latest repo, there 32G memory and 2x1080Ti cards, and process got killed every 10k iters

Also try to increase subdivisions= twice (2x), and run training with -gpus 0,0 (both numbers are the same).

Will it be killed after 10k iters?

Kyuuki93 commented 4 years ago

Try to set max_batches=2000 and run training with Valgrind on multi-GPU

valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=valgrind-out.txt ./darknet classifier train .... -gpus 0,1

After 2000 iterations attach there valgrind-out.txt

This process got killed before training started,

image

This is valgrind log file valgrind-out.txt

Kyuuki93 commented 4 years ago

Also try to increase subdivisions= twice (2x), and run training with -gpus 0,0 (both numbers are the same).

Will it be killed after 10k iters?

@AlexeyAB yes, process got killed this way

AlexeyAB commented 4 years ago

This process got killed before training started,

Try to set width=320 height=320 and run multi-gpu training with valgrind again

Kyuuki93 commented 4 years ago

Try to set width=320 height=320 and run multi-gpu training with valgrind again

@AlexeyAB I realized that is memory limit to run valgrind, so I tried to double subdivisions and got this result, it’s passed about 10k iters, valgrind log valgrind-out2.zip

AlexeyAB commented 4 years ago

@AlexeyAB I realized that is memory limit to run valgrind, so I tried to double subdivisions and got this result, it’s passed about 10k iters, valgrind log valgrind-out2.zip

Thanks! So it was not killed? Did the training end automatically upon reaching max_batches=10000?

The kernel would only kill a process under exceptional circumstances such as extreme resource starvation (think mem+swap exhaustion).

==7914== LEAK SUMMARY:
==7914==    definitely lost: 5,408 bytes in 55 blocks
==7914==    indirectly lost: 512 bytes in 1 blocks
==7914==      possibly lost: 3,764,580 bytes in 28,938 blocks
==7914==    still reachable: 6,022,531,512 bytes in 767,536 blocks
==7914==                       of which reachable via heuristic:
==7914==                         length64           : 5,432 bytes in 77 blocks
==7914==                         newarray           : 1,760 bytes in 30 blocks
==7914==         suppressed: 0 bytes in 0 blocks

There is memory leak = 5 KB, Possibly 3.7 MB. So there is no noticeable memory leak.

AlexeyAB commented 4 years ago

@Kyuuki93 I fixed several minor bugs: https://github.com/AlexeyAB/darknet/commit/f1ffb09d8bffa82aa3cf0c2ca405763eaa591f59

So you can try to run training with multi-gpu and -map with valgrind again.

Kyuuki93 commented 4 years ago

Ubuntu 18.04, CUDA 10.0, Nvidia driver 410.104, GeForce GTX 1080Ti.

Yes, darknet was compile with GPU=1 CUDNN=1 OPENCV=1, and will not get killed when doubled subdivisions

How to check CPU-RAM ? System Monitor?

AlexeyAB commented 4 years ago

System Monitor?

Yes,

Or top command in a separate terminal.

Kyuuki93 commented 4 years ago

@AlexeyAB

image

I used latest repo, and training after one night, it seems problem still exists, process stoped at 12737 iters, and this on the nvidia-drivers 410.104.

training command ./darknet detector train ... -gpus 0,1 -map

After this, I updated nvidia-drivers to nvidia-drivers 440.36, and process got killed again, System Monitor at training just started:

image

System Monitor at training 9226 iters:

image
AlexeyAB commented 4 years ago

@Kyuuki93 Can you try to run training with Valgrind again and put Output log?

AlexeyAB commented 4 years ago

@WongKinYiu You trained many models using this repository.

WongKinYiu commented 4 years ago

@AlexeyAB

AlexeyAB commented 4 years ago

@WongKinYiu

4 March 2019

Ok I will try to find the reason.

so for training lightweight models, I add the functions I need into old repo

What functions did you add?

I think the problem maybe occur after add group convolutional function.

Do you mean that models without group-conv doesn't have this issue? Or do you mean that together with group-conv I added some another bug in another place?

WongKinYiu commented 4 years ago

@AlexeyAB

  1. GIoU
  2. SAM (there are some bugs in the repo, I fixed it also)
  3. Maxpool across depth
  4. Scale_x_y
  5. Stride of Maxpool
  6. anti-aliasing (old version)
  7. assisted excitation (old version)

No, all of models have the issue. But above methods do not have the issue. So I think it happens in convolutional layer when adding group convolutional function.

AlexeyAB commented 4 years ago

@WongKinYiu

So you didn't use Cmake for Darknet compiling on both Linux and Windows PC? And do you have memory overflow issue for both training Detector and Classifier?

  1. This is strange that Valgrind can't detect memory leaks.

Yes, I have trained models on 3 servers (3 Ubuntu 16) and 9 pcs (2 Windows 10, 2 Ubuntu 18, 5 Ubuntu 16) to check it. Only 1 of server (1 Ubuntu 16) and 2 of pcs (2 Windows 10) do not have this problem. Although CPU-ram usage won't growth in Windows 10, sometimes the CPU usage may decrease.

  1. This is even more strange that this happens only on Linux, but not on Windows. So it seems the error in the 3rd party libraries: OpenCV, Pthread, STB, cuDNN, CUDA

  2. And it is completely incomprehensible why this problem is on one Ubuntu 16, but not on another Ubuntu 16.


4 March 2019

From 03 March to 13 March 2019 there were changed only 3 c-files:

image

image


With only these changes: image


There were many changes 2 March 2019 and 18 March 2019

WongKinYiu commented 4 years ago

@AlexeyAB

I use make on Ubuntu16/Ubuntu18, and VS2015/2017/2019 on Windows10.

And after 4 March 2019, I download new repo at 15 May 2019, then problem occurs.

Yes, both of classifier and detector. The memory usage grow faster when training a classifier, 3GB to >128GB (800k epochs). when training a detector, 4GB to ~60GB (500k epochs).

AlexeyAB commented 4 years ago

@WongKinYiu Do you train by using only 1 GPU and without -map flag, and get memory overflow issue?

WongKinYiu commented 4 years ago

@AlexeyAB

Yes, If use multiple GPUs, 6 hours can make 256GB ram full even training a detector. I use: CUDA_VISIBLE_DEVICES=0 ./darknet detector train coco.data coco.cfg coco.conv -dont_show -gpus 0

AlexeyAB commented 4 years ago

@WongKinYiu

Can you check this issue on the latest Darknet version just by using 3 tests on any small model, on PC where is memory overflow occurs?

  1. Train Detector for the 200 iterations:

  1. The same for training Classifier (but with OPENCV=0)

  1. Train Classifier (with OPENCV=0) a lot of iterations, for example: max_batches=100000, so there will be occupied significant amount of memory

It will help me to localize the problem and fix this issue.

WongKinYiu commented 4 years ago

@AlexeyAB

I will try to install Valgrind after finish my breakfast.

WongKinYiu commented 4 years ago

@AlexeyAB

I also set this as priority: high, now examine:

  1. TBD (OPENCV=1); TBD (OPENCV=0).

  2. valgrind-out_darknet_cuda_cudnn_avx_openmp_opencv.txt (OPENCV=1); valgrind-out_darknet_cuda_cudnn_avx_openmp.txt (OPENCV=0).

  3. 40k 10G, 160k 23.5G (OPENCV=1); 70k 4G (OPENCV=0).

Kyuuki93 commented 4 years ago

@Kyuuki93 Can you try to run training with Valgrind again and put Output log?

valgrind-out.zip

double subdivisions or set image (h,w) from (416,416) to (320,320) or do both, the training was very slow, nearly stagnant.

When training with 1 GPU, CPU-MEM hold at 6.6GiB(21%) of 31.3GiB, there may be a problem in Syncing?

joelmatt commented 4 years ago

@AlexeyAB Does the reason for the error "Killed" remain same for the error "Segmentation fault (core dumped)". I am racking my brain around to find a solution for this error and i need you help please. after certain iterations i encounter this error

(next mAP calculation at 1000 iterations) 
 6: 1277.488403, 1277.129028 avg loss, 0.000000 rate, 2.127515 seconds, 192 images
Loaded: 0.000046 seconds
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 16 Avg (IOU: 0.460068, GIOU: 0.460068), Class: 0.394136, Obj: 0.560775, No Obj: 0.561543, .5R: 0.000000, .75R: 0.000000, count: 1
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 23 Avg (IOU: 0.287364, GIOU: 0.287364), Class: 0.568923, Obj: 0.385361, No Obj: 0.490695, .5R: 0.000000, .75R: 0.000000, count: 1
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 16 Avg (IOU: 0.314481, GIOU: 0.119691), Class: 0.339161, Obj: 0.574429, No Obj: 0.561120, .5R: 0.000000, .75R: 0.000000, count: 2
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 23 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.493843, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 16 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.559551, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 23 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.490230, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 16 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.559821, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 23 Avg (IOU: 0.312913, GIOU: 0.195579), Class: 0.433343, Obj: 0.407917, No Obj: 0.491107, .5R: 0.000000, .75R: 0.000000, count: 2
Segmentation fault (core dumped)

i tried change the width and heriht from 960 to 800 also making random from 1 to 0 and also batch and subdivisions from 64 to 32 but in vain.... Please Please can you help me out here

WongKinYiu commented 4 years ago

@joelmatt

do you train with cutmix=1 or other data augmentation methods? if yes, it will get segmentation fault using latest commit.

by the way, mosaic=1 works fine.

AlexeyAB commented 4 years ago

@Kyuuki93 Thanks!

When training with 1 GPU, CPU-MEM hold at 6.6GiB(21%) of 31.3GiB, there may be a problem in Syncing?

So you don't have memory overflow issue with 1 x GPU?


double subdivisions or set image (h,w) from (416,416) to (320,320) or do both, the training was very slow, nearly stagnant.

Training is slow? Or loss decreasing is slow? Or memory-usage increasing is slow?


valgrind-out.zip

There is not found memory leaks by Valgrind:

==31553== LEAK SUMMARY: ==31553== definitely lost: 0 bytes in 0 blocks ==31553== indirectly lost: 0 bytes in 0 blocks ==31553== possibly lost: 3,922,392 bytes in 30,358 blocks ==31553== still reachable: 6,017,118,411 bytes in 713,328 blocks ==31553== of which reachable via heuristic: ==31553== length64 : 5,560 bytes in 79 blocks ==31553== newarray : 3,632 bytes in 31 blocks ==31553== suppressed: 0 bytes in 0 blocks


Kyuuki93 commented 4 years ago

@AlexeyAB 1 iter (one batch) need more than 1 hour, this log file only record 10 iters for 10 hours, that slow

AlexeyAB commented 4 years ago

@WongKinYiu Thanks!

I also set this as priority: high, now examine:

  1. TBD (OPENCV=1); TBD (OPENCV=0).

  2. valgrind-out_darknet_cuda_cudnn_avx_openmp_opencv.txt (OPENCV=1); valgrind-out_darknet_cuda_cudnn_avx_openmp.txt (OPENCV=0).

It seems I fixed definitely lost & indirectly lost: https://github.com/AlexeyAB/darknet/commit/2207acd9c432669e3f4251107791a1df6c519d99

==185817== LEAK SUMMARY: ==185817== definitely lost: 24 bytes in 1 blocks ==185817== indirectly lost: 3,408 bytes in 20 blocks ==185817== possibly lost: 775,332 bytes in 5,985 blocks ==185817== still reachable: 3,394,423,999 bytes in 2,993,790 blocks ==185817== suppressed: 0 bytes in 0 blocks


  1. 40k 10G, 160k 23.5G (OPENCV=1); 70k 4G (OPENCV=0).
  • opencv=1 has issue.
AlexeyAB commented 4 years ago

@Kyuuki93

@AlexeyAB 1 iter (one batch) need more than 1 hour, this log file only record 10 iters for 10 hours, that slow

I know that DEBUG=1 can slows down 10x-100x training rather than increasing subdivisions= or decreasing width= height=.

So you don't have an issue with memory overflow if you compiled with

Kyuuki93 commented 4 years ago

@Kyuuki93

@AlexeyAB 1 iter (one batch) need more than 1 hour, this log file only record 10 iters for 10 hours, that slow

I know that DEBUG=1 can slows down 10x-100x training rather than increasing subdivisions= or decreasing width= height=.

So you don't have an issue with memory overflow if you compiled with

  • Without OpenCV: GPU=1 CUDNN=1 OPENCV=0 CUDNN_HALF=0 AVX=0 OPENMP=0 LIBSO=0 DEBUG=0
  • With OpenCV GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=0 AVX=0 OPENMP=0 LIBSO=0 DEBUG=0 , but only with 1 GPU?

Ok, I set OPENCV=1 to OPENCV=0, training command was ./darknet detector train ... -gpus 0,1 -map, this is training just started:

image

It's seems still exists, after 2500 iters:

image

after 3500 iters:

image
AlexeyAB commented 4 years ago

@Kyuuki93 @WongKinYiu Also check your bad.list file. Clear it before training, and check it after training.

WongKinYiu commented 4 years ago

@AlexeyAB

  • What command did you use for training classifier?

I just replace detector train to classifier train.

  • Can you try change #ifdef OPENCV to #ifdef OPENCV_DISABLED there, recompile with OPENCV=1 and check if there is a memory overflow here?

Yes, I can.

AlexeyAB commented 4 years ago

@WongKinYiu Just for test try to train without CUDA_VISIBLE_DEVICES=0 and without -gpus 0

WongKinYiu commented 4 years ago

@AlexeyAB

update log files.

  1. valgrind-out_v3tiny_cuda_cudnn_avx_openmp_opencv.txt (OPENCV=1); valgrind-out_v3tiny_cuda_cudnn_avx_openmp.txt (OPENCV=0).

  2. valgrind-out_darknet_cuda_cudnn_avx_openmp_opencv.txt valgrind-out_darknet_cuda_cudnn_avx_openmp_opencv.txt (OPENCV=1); valgrind-out_darknet_cuda_cudnn_avx_openmp.txt valgrind-out_darknet_cuda_cudnn_avx_openmp.txt (OPENCV=0).

  3. 40k 10G, 160k 23.5G (OPENCV=1); 70k 4G (OPENCV=0).

AlexeyAB commented 4 years ago

@WongKinYiu Thanks!

40k 10G, 160k 23.5G (OPENCV=1); 70k 4G (OPENCV=0).

Is this memory-usage During training or After training (while wait for keypress on getchar()?

measure CPU-memory usage after completing the training (when the darknet will wait on getchar();)


Do you use the latest version of https://github.com/AlexeyAB/darknet or your modified repo?