AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 #4386

Open AlexeyAB opened 4 years ago

AlexeyAB commented 4 years ago

Higher mini_batch -> higher accuracy mAP/Top1/Top5.

Training on GPU by using CPU-RAM allows significantly increase the size of the mini batch 4x-16x times and more.

You can train with 16x higher mini_batch, but with 5x lower speed on Yolov3-spp, it should give you ~+2-4 mAP.

Use in your cfg-file:

[net]
batch=64
subdivisions=2
width=416
height=416
optimized_memory=3
workspace_size_limit_MB=1000

Tested:

Tested on model https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg with wifth=416 height=416 on 8GB_GPU_VRAM + 32GB_CPU_RAM

./darknet detector train data/obj.data yolov3-spp.cfg -map


Not well tested yet:



Example of trained model: yolov3-tiny_pan_gaus_giou_scale.cfg.txt

mini_batch=32 +5 mAP@0.5 mini_batch=8
chart chart
--- ---
AlexeyAB commented 4 years ago

@erikguo

BTW, did you run the CPU-MEM training with 4 GPUs together?

No, because there will be required 4x more CPU-RAM for the same mini_batch_size. It will be 4x faster (if you have 64 - 128 PCIe-lanes on CPU - like AMD Epyc CPU), but it will require 4x more CPU-RAM.

kossolax commented 4 years ago

isn't there a gpu memory leak ? After doing "free_network" there are still memory used on nvidia-smi. Adding a loop will full-fill gpu then crash.

for(int p=0; p<1000; p++) {

        network subnet = parse_network_cfg(cfgfile);
        if (weightfile) {
            load_weights(&subnet, weightfile);
        }

        *subnet.seen = 0;

        while ( *subnet.seen < train_images_num ) {

            pthread_join(load_thread, 0);
            train = buffer;
            load_thread = load_data(args);

            float loss = train_network_waitkey(subnet, train, 0);
            free_data(train);
        }

        int tmp = subnet.batch;

        set_batch_network(&subnet, 1);
        float map = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, subnet.letter_box, &subnet);
    printf("%f", map);

        set_batch_network(&subnet, tmp);

        free_network(subnet);
}
AlexeyAB commented 4 years ago

@kossolax Is it related to optimized_memory=3 and GPU-processing on CPU-RAM? Or just realted to free_network()?

kossolax commented 4 years ago

I'm using optimized_memory=0, so it's just related to free_network. As you changed much memory usage, I guess this could be related, should I start a new issue?

AlexeyAB commented 4 years ago

@kossolax Yes, start new issue, I will investigate it.

WongKinYiu commented 4 years ago

@AlexeyAB Hello,

I think cross iteration batch normalization can achieve similar result but higher training speed. https://github.com/Howal/Cross-iterationBatchNorm

AlexeyAB commented 4 years ago

@WongKinYiu Hi,

I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing batch= in cfg-file, and set cbn=1 instead of batch_normalize=1 So batch=120 subdivisions=4 with CBN, should work better than batch=120 subdivisions=4 with BN. But batch=120 subdivisions=4 with CBN, will work worse than batch=120 subdivisions=1 with BN.

I.e. using batch=64 subdivisions=8 with BN, avg mini_batch_size = 8 64/8 = 8

I.e. using batch=64 subdivisions=8 with CBN, avg mini_batch_size = 36 (8+16+24+32+40+48+56+64)/8 = 36

You can try it on Classifier csresnext50


So inside 1 batch it will average the values of Mean and Variance. I.e if you train with batch=64 subdivisions=16, then will be 16 mini_batches with size 4.

For using:

[convolutional]
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

or

[convolutional]
batch_normalize=1
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

or

[convolutional]
batch_normalize=2
filters=16
size=3
stride=1
pad=1
activation=leaky

Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence.

Paper: https://arxiv.org/abs/2002.05712v2

image


I used these formulas:

image


image

WongKinYiu commented 4 years ago

@AlexeyAB

Thank you a lot, i ll give you the feedback after finish training.

AlexeyAB commented 4 years ago

@WongKinYiu

I also added dynamic mini batch size when you train with random=1: https://github.com/AlexeyAB/darknet/commit/c814d56ec11ed3b22264d8efb2dd4ed27329f5d1

Just add dynamic_minibatch=1 in the [net] section:

[net]
batch=64
subdivisions=8
dynamic_minibatch=1
width=416
height=416

...
[yolo]
random=1

So

So even if part of CBN will not work properly, you can still use dynamic_minibatch=1 to increase mini_batch size.

0.8 is just a coefficient to avoid out of memory for some network resolutions (sometime cuDNN require much more memory for lower resolution than for higher), but you can try to set 0.9: https://github.com/AlexeyAB/darknet/blob/c814d56ec11ed3b22264d8efb2dd4ed27329f5d1/src/detector.c#L191


Also you can adjust mini batch size to your GPU-RAM amount (not necessarily batch and subdivision should be a multiple of 2) batch / subdivisions = mini_batch_size 64/8 = 8 63/7 = 9 70/7 = 10 66/6 = 11 60/5 = 12 65/5 = 13 70/5 = 14 60/4 = 15 64/4 = 16

WongKinYiu commented 4 years ago

@AlexeyAB OK,

Thank you, SpineNet-49-omega will finish training in half hour. Will report the result soon.

Answergeng commented 4 years ago

I tried yolov3-spp.cfg with following setting : optimized_memory=3 workspace_size_limit_MB=1000 my cpu-ram is 64g, after loading use 20.9g but always stuck at here

net.optimized_memory = 3 batch = 1, time_steps = 1, train = 0 yolov3-spp net.optimized_memory = 3 pre_allocate... pinned_ptr = 0000000000000000 pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block batch = 8, time_steps = 1, train = 1 Pinned block_id = 0, filled = 88.134911 % Pinned block_id = 1, filled = 96.948578 % Pinned block_id = 2, filled = 96.949005 % Pinned block_id = 3, filled = 99.152946 % Pinned block_id = 4, filled = 99.153809 % Pinned block_id = 5, filled = 98.830368 % Pinned block_id = 6, filled = 99.875595 % Done! Loaded 85 layers from weights-file

could you tell me why?

Answergeng commented 4 years ago

I tried yolov3-spp.cfg with following setting : optimized_memory=3 workspace_size_limit_MB=1000 my cpu-ram is 64g, after loading use 20.9g but always stuck at here

net.optimized_memory = 3 batch = 1, time_steps = 1, train = 0 yolov3-spp net.optimized_memory = 3 pre_allocate... pinned_ptr = 0000000000000000 pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block batch = 8, time_steps = 1, train = 1 Pinned block_id = 0, filled = 88.134911 % Pinned block_id = 1, filled = 96.948578 % Pinned block_id = 2, filled = 96.949005 % Pinned block_id = 3, filled = 99.152946 % Pinned block_id = 4, filled = 99.153809 % Pinned block_id = 5, filled = 98.830368 % Pinned block_id = 6, filled = 99.875595 % Done! Loaded 85 layers from weights-file

could you tell me why?

now, I got error

CUDA Error: invalid device pointer: No error Assertion failed: 0, file ....\src\utils.c, line 325

LucasSloan commented 4 years ago

Just tried to run with this on:

batch=64
subdivisions=4
dynamic_minibatch=1
width=960
height=576
optimized_memory=3
workspace_size_limit_MB=8000

and got this error:

CUDA status Error: file: /home/lucas/Development/darknet/src/dark_cuda.c : () : line: 454 : build time: May 18 2020 - 15:30:02 

 CUDA Error: invalid device pointer
CUDA Error: invalid device pointer: Resource temporarily unavailable

I've tried several different values for workspace_size_limit_MB and subdivisions and all fail with the same message. I was running with a single gpu, and I peaked at about 40 GB / 64 GB memory usage on the cpu.

arnaud-nt2i commented 4 years ago

@WongKinYiu @AlexeyAB @cenit @LukeAi

Hi everyone! Two simple questions I could not find answers everywhere else... Even on google scholar for the second one...

1) ~Is it possible to use dynamic_mini batch=1 while using custom resize of the network eg: "random=1.34"?~ |--> Yes 2) ~Is it possible to use dynamic_mini batch=1 and batch_normalize=2 at the same Time Without messing everything up?~ |--> Yes 3) ~How is it possible that the mini_batch parameter has an influence on mAP with consistent batch size?~ |--> Because Batch normalization is done on Mini-Batch size and not on Batch size.

Has far as my understanding goes, the batch size is the number of samples processed before the weighs update but mini_batch is just a computational trick to avoid loading and processing the batch in one time and should not have an impact...

I would be very happy with an answer to those questions and I'm sure I am not alone not understanding.

igoriok1994 commented 3 years ago

What parameters I can use with nVidia Quadro M1000M (GPU_RAM = 2GB) and I7 + CPU_RAM = 64GB?

image image

###
# Training
batch=64
subdivisions=8

###
width=608
height=608

###
optimized_memory=3
workspace_size_limit_MB=2000
mini_batch=16

Tried to use these, but 100h+ for training - too long.


On other PC with GTX970 4GB and I5 16GB with parameters

###
# Training
batch=64
subdivisions=16

###s
width=608
height=608

I've got ~16-20h of training

Classes=5, max iterations= 10000.

igoriok1994 commented 3 years ago

On laptop with settings:

###
# Training
batch=64
subdivisions=32

###
width=608
height=608

### NOT USED ###
# optimized_memory=3
# workspace_size_limit_MB=2000
# mini_batch=16

getting this:

image

Btw this is Tiny YoloV4

pullmyleg commented 3 years ago

@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.

igoriok1994 commented 3 years ago

@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.

I want to speed up training without mAP loss :)

pullmyleg commented 3 years ago

@igoriok1994 CPU memory is very slow, in my experience 5x + slower than regular GPU training. The benefit of CPU memory training is to increase precision (mAP) by increasing the batch size beyond the memory available on your GPU.

nanhui69 commented 3 years ago

@WongKinYiu Hi,

I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing batch= in cfg-file, and set cbn=1 instead of batch_normalize=1 So batch=120 subdivisions=4 with CBN, should work better than batch=120 subdivisions=4 with BN. But batch=120 subdivisions=4 with CBN, will work worse than batch=120 subdivisions=1 with BN.

I.e. using batch=64 subdivisions=8 with BN, avg mini_batch_size = 8 64/8 = 8

I.e. using batch=64 subdivisions=8 with CBN, avg mini_batch_size = 36 (8+16+24+32+40+48+56+64)/8 = 36

You can try it on Classifier csresnext50

So inside 1 batch it will average the values of Mean and Variance. I.e if you train with batch=64 subdivisions=16, then will be 16 mini_batches with size 4.

  • For the 1st mini_batch will use Mean[1] & Variance[1]
  • For the 2nd mini_batch will use avg(Mean[1], Mean[2]) & avg(Variance[1], Variance[2])
  • For the 3rd mini_batch will use avg(Mean[1], Mean[2], Mean[3]) & avg(Variance[1], Variance[2], Variance[3]) ....

For using:

[convolutional]
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

or

[convolutional]
batch_normalize=1
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

or

[convolutional]
batch_normalize=2
filters=16
size=3
stride=1
pad=1
activation=leaky

Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence.

Paper: https://arxiv.org/abs/2002.05712v2

image

I used these formulas:

image

image

does we need to change batch_normalize's setting in every convolutional part in cfg file ? , the numbers of convolutional is 73 @AlexeyAB

pullmyleg commented 3 years ago

@AlexeyAB have you seen this implementation for decreasing memory usage allowing larger batches with the same GPU memory? https://github.com/MegEngine/MegEngine/wiki/Reduce-GPU-memory-usage-by-Dynamic-Tensor-Rematerialization