Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128

AlexeyAB commented 4 years ago

Higher mini_batch -> higher accuracy mAP/Top1/Top5.

Training on GPU by using CPU-RAM allows significantly increase the size of the mini batch 4x-16x times and more.

You can train with 16x higher mini_batch, but with 5x lower speed on Yolov3-spp, it should give you ~+2-4 mAP.

Use in your cfg-file:

[net]
batch=64
subdivisions=2
width=416
height=416
optimized_memory=3
workspace_size_limit_MB=1000

multi-GPU is not tested
random=1 is not supported

Tested:

GeForce RTX 2070 - 8 GB VRAM
CPU Core i7 6700K - 32 GB RAM

Tested on model https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg with wifth=416 height=416 on 8GB_GPU_VRAM + 32GB_CPU_RAM

./darknet detector train data/obj.data yolov3-spp.cfg -map

default: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=6.5 GB, iteration = 3 sec
optimized_memory=1: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=5.8 GB, iteration = 3 sec
optimized_memory=2 workspace_size_limit_MB=1000: mini_batch=20 = batch_60 / subdivisions_3, GPU-RAM-usage=5.4 GB, iteration = 15 sec
optimized_memory=3 workspace_size_limit_MB=1000: mini_batch=32 = batch_64 / subdivisions_2, GPU-RAM-usage=4.0 GB, iteration = 15 sec (CPU-RAM-usage = 31 GB)

Not well tested yet:

optimized_memory=3 workspace_size_limit_MB=2000: mini_batch=64 = batch_128 / subdivisions_2, GPU-RAM-usage=7.5 GB, iteration = 15 sec (CPU-RAM-usage = 62 GB)
optimized_memory=3 workspace_size_limit_MB=2000 or 4000: mini_batch=128 = batch_256 / subdivisions_2, GPU-RAM-usage=13.5 GB, iteration = 15 sec (CPU-RAM-usage = 124 GB)

mini_batch=24 - 24 GB VRAM RTX Titan - $2500: https://www.amazon.com/NVIDIA-Titan-RTX-Graphics-Card/dp/B07L8YGDL5
mini_batch=48 - 48 GB VRAM Quadro RTX 8000 - $5500: https://www.amazon.com/PNY-VCQRTX8000-PB-NVIDIA-Quadro-Graphic/dp/B07NH3HKG9/
mini_batch=128 - 128 GB RAM - $1700 = RTX 2080 Ti 11 GB - $1100 + $600 CPU-RAM 128 GB = 4x32 + with this software solution
mini_batch=512 - 512 GB RAM - $9200 = 48 GB VRAM Quadro RTX 8000 - $5500 + 512GB=2 x (8 x 32GB), $2600 + $1100 - CPU AMD EPYC 7401P - 32 cores, 16 memory slots up to 2 TB RAM and 128 PCIe® 3.0 lanes + with this software solution
mini_batch=512 - 512 GB VRAM (16 x 32GB Tesla V100) DGX2 - $400 000 https://www.nvidia.com/en-us/data-center/dgx-2/ + with synchronized batch normalization technique solution like: https://arxiv.org/abs/1711.07240v4

Example of trained model: yolov3-tiny_pan_gaus_giou_scale.cfg.txt

mini_batch=32 `+5 mAP@0.5`	mini_batch=8

---	---

AlexeyAB commented 4 years ago

@erikguo

BTW, did you run the CPU-MEM training with 4 GPUs together?

No, because there will be required 4x more CPU-RAM for the same mini_batch_size. It will be 4x faster (if you have 64 - 128 PCIe-lanes on CPU - like AMD Epyc CPU), but it will require 4x more CPU-RAM.

kossolax commented 4 years ago

isn't there a gpu memory leak ? After doing "free_network" there are still memory used on nvidia-smi. Adding a loop will full-fill gpu then crash.

for(int p=0; p<1000; p++) {

        network subnet = parse_network_cfg(cfgfile);
        if (weightfile) {
            load_weights(&subnet, weightfile);
        }

        *subnet.seen = 0;

        while ( *subnet.seen < train_images_num ) {

            pthread_join(load_thread, 0);
            train = buffer;
            load_thread = load_data(args);

            float loss = train_network_waitkey(subnet, train, 0);
            free_data(train);
        }

        int tmp = subnet.batch;

        set_batch_network(&subnet, 1);
        float map = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, subnet.letter_box, &subnet);
    printf("%f", map);

        set_batch_network(&subnet, tmp);

        free_network(subnet);
}

AlexeyAB commented 4 years ago

@kossolax Is it related to optimized_memory=3 and GPU-processing on CPU-RAM? Or just realted to free_network()?

kossolax commented 4 years ago

I'm using optimized_memory=0, so it's just related to free_network. As you changed much memory usage, I guess this could be related, should I start a new issue?

AlexeyAB commented 4 years ago

@kossolax Yes, start new issue, I will investigate it.

WongKinYiu commented 4 years ago

@AlexeyAB Hello,

I think cross iteration batch normalization can achieve similar result but higher training speed. https://github.com/Howal/Cross-iterationBatchNorm

AlexeyAB commented 4 years ago

@WongKinYiu Hi,

I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing batch= in cfg-file, and set cbn=1 instead of batch_normalize=1 So batch=120 subdivisions=4 with CBN, should work better than batch=120 subdivisions=4 with BN. But batch=120 subdivisions=4 with CBN, will work worse than batch=120 subdivisions=1 with BN.

I.e. using batch=64 subdivisions=8 with BN, avg mini_batch_size = 8 64/8 = 8

I.e. using batch=64 subdivisions=8 with CBN, avg mini_batch_size = 36 (8+16+24+32+40+48+56+64)/8 = 36

You can try it on Classifier csresnext50

So inside 1 batch it will average the values of Mean and Variance. I.e if you train with batch=64 subdivisions=16, then will be 16 mini_batches with size 4.

For the 1st mini_batch will use Mean[1] & Variance[1]
For the 2nd mini_batch will use avg(Mean[1], Mean[2]) & avg(Variance[1], Variance[2])
For the 3rd mini_batch will use avg(Mean[1], Mean[2], Mean[3]) & avg(Variance[1], Variance[2], Variance[3]) ....

For using:

[convolutional]
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

or

[convolutional]
batch_normalize=1
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky

or

[convolutional]
batch_normalize=2
filters=16
size=3
stride=1
pad=1
activation=leaky

Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence.

Paper: https://arxiv.org/abs/2002.05712v2

I used these formulas:

WongKinYiu commented 4 years ago

@AlexeyAB

Thank you a lot, i ll give you the feedback after finish training.

AlexeyAB commented 4 years ago

@WongKinYiu

I also added dynamic mini batch size when you train with random=1: https://github.com/AlexeyAB/darknet/commit/c814d56ec11ed3b22264d8efb2dd4ed27329f5d1

Just add dynamic_minibatch=1 in the [net] section:

[net]
batch=64
subdivisions=8
dynamic_minibatch=1
width=416
height=416

...
[yolo]
random=1

So

network resolution will be 288x288 - 608x608 due to random=1
for 608x608 the mini batch size = batch/subdivisions = 8
for 416x416 the mini batch size = 0.8 x ((608x608)/(416x416)) x batch/subdivisions = 13
for 288x288 the mini batch size = 0.8 x ((608x608)/(288x288)) x batch/subdivisions = 28

So even if part of CBN will not work properly, you can still use dynamic_minibatch=1 to increase mini_batch size.

0.8 is just a coefficient to avoid out of memory for some network resolutions (sometime cuDNN require much more memory for lower resolution than for higher), but you can try to set 0.9: https://github.com/AlexeyAB/darknet/blob/c814d56ec11ed3b22264d8efb2dd4ed27329f5d1/src/detector.c#L191

Also you can adjust mini batch size to your GPU-RAM amount (not necessarily batch and subdivision should be a multiple of 2) batch / subdivisions = mini_batch_size 64/8 = 8 63/7 = 9 70/7 = 10 66/6 = 11 60/5 = 12 65/5 = 13 70/5 = 14 60/4 = 15 64/4 = 16

WongKinYiu commented 4 years ago

@AlexeyAB OK,

Thank you, SpineNet-49-omega will finish training in half hour. Will report the result soon.

Answergeng commented 4 years ago

I tried yolov3-spp.cfg with following setting : optimized_memory=3 workspace_size_limit_MB=1000 my cpu-ram is 64g, after loading use 20.9g but always stuck at here

net.optimized_memory = 3 batch = 1, time_steps = 1, train = 0 yolov3-spp net.optimized_memory = 3 pre_allocate... pinned_ptr = 0000000000000000 pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block batch = 8, time_steps = 1, train = 1 Pinned block_id = 0, filled = 88.134911 % Pinned block_id = 1, filled = 96.948578 % Pinned block_id = 2, filled = 96.949005 % Pinned block_id = 3, filled = 99.152946 % Pinned block_id = 4, filled = 99.153809 % Pinned block_id = 5, filled = 98.830368 % Pinned block_id = 6, filled = 99.875595 % Done! Loaded 85 layers from weights-file

could you tell me why?

Answergeng commented 4 years ago

I tried yolov3-spp.cfg with following setting : optimized_memory=3 workspace_size_limit_MB=1000 my cpu-ram is 64g, after loading use 20.9g but always stuck at here

net.optimized_memory = 3 batch = 1, time_steps = 1, train = 0 yolov3-spp net.optimized_memory = 3 pre_allocate... pinned_ptr = 0000000000000000 pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block batch = 8, time_steps = 1, train = 1 Pinned block_id = 0, filled = 88.134911 % Pinned block_id = 1, filled = 96.948578 % Pinned block_id = 2, filled = 96.949005 % Pinned block_id = 3, filled = 99.152946 % Pinned block_id = 4, filled = 99.153809 % Pinned block_id = 5, filled = 98.830368 % Pinned block_id = 6, filled = 99.875595 % Done! Loaded 85 layers from weights-file

could you tell me why?

now, I got error

CUDA Error: invalid device pointer: No error Assertion failed: 0, file ....\src\utils.c, line 325

LucasSloan commented 4 years ago

Just tried to run with this on:

batch=64
subdivisions=4
dynamic_minibatch=1
width=960
height=576
optimized_memory=3
workspace_size_limit_MB=8000

and got this error:

CUDA status Error: file: /home/lucas/Development/darknet/src/dark_cuda.c : () : line: 454 : build time: May 18 2020 - 15:30:02 

 CUDA Error: invalid device pointer
CUDA Error: invalid device pointer: Resource temporarily unavailable

I've tried several different values for workspace_size_limit_MB and subdivisions and all fail with the same message. I was running with a single gpu, and I peaked at about 40 GB / 64 GB memory usage on the cpu.

arnaud-nt2i commented 4 years ago

@WongKinYiu @AlexeyAB @cenit @LukeAi

Hi everyone! Two simple questions I could not find answers everywhere else... Even on google scholar for the second one...

1) ~Is it possible to use dynamic_mini batch=1 while using custom resize of the network eg: "random=1.34"?~ |--> Yes 2) ~Is it possible to use dynamic_mini batch=1 and batch_normalize=2 at the same Time Without messing everything up?~ |--> Yes 3) ~How is it possible that the mini_batch parameter has an influence on mAP with consistent batch size?~ |--> Because Batch normalization is done on Mini-Batch size and not on Batch size.

Has far as my understanding goes, the batch size is the number of samples processed before the weighs update but mini_batch is just a computational trick to avoid loading and processing the batch in one time and should not have an impact...

I would be very happy with an answer to those questions and I'm sure I am not alone not understanding.

igoriok1994 commented 3 years ago

What parameters I can use with nVidia Quadro M1000M (GPU_RAM = 2GB) and I7 + CPU_RAM = 64GB?

###
# Training
batch=64
subdivisions=8

###
width=608
height=608

###
optimized_memory=3
workspace_size_limit_MB=2000
mini_batch=16

Tried to use these, but 100h+ for training - too long.

On other PC with GTX970 4GB and I5 16GB with parameters

###
# Training
batch=64
subdivisions=16

###s
width=608
height=608

I've got ~16-20h of training

Classes=5, max iterations= 10000.

igoriok1994 commented 3 years ago

On laptop with settings:

###
# Training
batch=64
subdivisions=32

###
width=608
height=608

### NOT USED ###
# optimized_memory=3
# workspace_size_limit_MB=2000
# mini_batch=16

getting this:

Btw this is Tiny YoloV4

pullmyleg commented 3 years ago

@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.

igoriok1994 commented 3 years ago

@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.

I want to speed up training without mAP loss :)

pullmyleg commented 3 years ago

@igoriok1994 CPU memory is very slow, in my experience 5x + slower than regular GPU training. The benefit of CPU memory training is to increase precision (mAP) by increasing the batch size beyond the memory available on your GPU.

nanhui69 commented 3 years ago

@WongKinYiu Hi,

I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing batch= in cfg-file, and set cbn=1 instead of batch_normalize=1 So batch=120 subdivisions=4 with CBN, should work better than batch=120 subdivisions=4 with BN. But batch=120 subdivisions=4 with CBN, will work worse than batch=120 subdivisions=1 with BN.

I.e. using batch=64 subdivisions=8 with BN, avg mini_batch_size = 8 64/8 = 8

I.e. using batch=64 subdivisions=8 with CBN, avg mini_batch_size = 36 (8+16+24+32+40+48+56+64)/8 = 36

You can try it on Classifier csresnext50

So inside 1 batch it will average the values of Mean and Variance. I.e if you train with batch=64 subdivisions=16, then will be 16 mini_batches with size 4.

For the 1st mini_batch will use Mean[1] & Variance[1]

For the 2nd mini_batch will use avg(Mean[1], Mean[2]) & avg(Variance[1], Variance[2])

For the 3rd mini_batch will use avg(Mean[1], Mean[2], Mean[3]) & avg(Variance[1], Variance[2], Variance[3]) ....

For using:
[convolutional]
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky
or
[convolutional]
batch_normalize=1
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky
or
[convolutional]
batch_normalize=2
filters=16
size=3
stride=1
pad=1
activation=leaky
Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence.

Paper: https://arxiv.org/abs/2002.05712v2

I used these formulas:

does we need to change batch_normalize's setting in every convolutional part in cfg file ? , the numbers of convolutional is 73 @AlexeyAB

pullmyleg commented 3 years ago

@AlexeyAB have you seen this implementation for decreasing memory usage allowing larger batches with the same GPU memory? https://github.com/MegEngine/MegEngine/wiki/Reduce-GPU-memory-usage-by-Dynamic-Tensor-Rematerialization

AlexeyAB / darknet

Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 #4386