Open AlexeyAB opened 4 years ago
@erikguo
BTW, did you run the CPU-MEM training with 4 GPUs together?
No, because there will be required 4x more CPU-RAM for the same mini_batch_size. It will be 4x faster (if you have 64 - 128 PCIe-lanes on CPU - like AMD Epyc CPU), but it will require 4x more CPU-RAM.
isn't there a gpu memory leak ? After doing "free_network" there are still memory used on nvidia-smi. Adding a loop will full-fill gpu then crash.
for(int p=0; p<1000; p++) {
network subnet = parse_network_cfg(cfgfile);
if (weightfile) {
load_weights(&subnet, weightfile);
}
*subnet.seen = 0;
while ( *subnet.seen < train_images_num ) {
pthread_join(load_thread, 0);
train = buffer;
load_thread = load_data(args);
float loss = train_network_waitkey(subnet, train, 0);
free_data(train);
}
int tmp = subnet.batch;
set_batch_network(&subnet, 1);
float map = validate_detector_map(datacfg, cfgfile, weightfile, 0.25, 0.5, 0, subnet.letter_box, &subnet);
printf("%f", map);
set_batch_network(&subnet, tmp);
free_network(subnet);
}
@kossolax Is it related to optimized_memory=3
and GPU-processing on CPU-RAM? Or just realted to free_network()?
I'm using optimized_memory=0, so it's just related to free_network. As you changed much memory usage, I guess this could be related, should I start a new issue?
@kossolax Yes, start new issue, I will investigate it.
@AlexeyAB Hello,
I think cross iteration batch normalization can achieve similar result but higher training speed. https://github.com/Howal/Cross-iterationBatchNorm
@WongKinYiu Hi,
I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing batch=
in cfg-file, and set cbn=1
instead of batch_normalize=1
So batch=120 subdivisions=4
with CBN, should work better than batch=120 subdivisions=4
with BN.
But batch=120 subdivisions=4
with CBN, will work worse than batch=120 subdivisions=1
with BN.
I.e. using batch=64 subdivisions=8
with BN, avg mini_batch_size = 8
64/8 = 8
I.e. using batch=64 subdivisions=8
with CBN, avg mini_batch_size = 36
(8+16+24+32+40+48+56+64)/8 = 36
You can try it on Classifier csresnext50
So inside 1 batch it will average the values of Mean and Variance. I.e if you train with batch=64 subdivisions=16, then will be 16 mini_batches with size 4.
For using:
[convolutional]
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky
or
[convolutional]
batch_normalize=1
cbn=1
filters=16
size=3
stride=1
pad=1
activation=leaky
or
[convolutional]
batch_normalize=2
filters=16
size=3
stride=1
pad=1
activation=leaky
Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence.
Paper: https://arxiv.org/abs/2002.05712v2
I used these formulas:
@AlexeyAB
Thank you a lot, i ll give you the feedback after finish training.
@WongKinYiu
I also added dynamic mini batch size when you train with random=1: https://github.com/AlexeyAB/darknet/commit/c814d56ec11ed3b22264d8efb2dd4ed27329f5d1
Just add dynamic_minibatch=1
in the [net] section:
[net]
batch=64
subdivisions=8
dynamic_minibatch=1
width=416
height=416
...
[yolo]
random=1
So
So even if part of CBN will not work properly, you can still use dynamic_minibatch=1
to increase mini_batch size.
0.8
is just a coefficient to avoid out of memory for some network resolutions (sometime cuDNN require much more memory for lower resolution than for higher), but you can try to set 0.9
: https://github.com/AlexeyAB/darknet/blob/c814d56ec11ed3b22264d8efb2dd4ed27329f5d1/src/detector.c#L191
Also you can adjust mini batch size to your GPU-RAM amount (not necessarily batch and subdivision should be a multiple of 2)
batch / subdivisions = mini_batch_size
64/8 = 8
63/7 = 9
70/7 = 10
66/6 = 11
60/5 = 12
65/5 = 13
70/5 = 14
60/4 = 15
64/4 = 16
@AlexeyAB OK,
Thank you, SpineNet-49-omega will finish training in half hour. Will report the result soon.
I tried yolov3-spp.cfg with following setting : optimized_memory=3 workspace_size_limit_MB=1000 my cpu-ram is 64g, after loading use 20.9g but always stuck at here
net.optimized_memory = 3 batch = 1, time_steps = 1, train = 0 yolov3-spp net.optimized_memory = 3 pre_allocate... pinned_ptr = 0000000000000000 pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block batch = 8, time_steps = 1, train = 1 Pinned block_id = 0, filled = 88.134911 % Pinned block_id = 1, filled = 96.948578 % Pinned block_id = 2, filled = 96.949005 % Pinned block_id = 3, filled = 99.152946 % Pinned block_id = 4, filled = 99.153809 % Pinned block_id = 5, filled = 98.830368 % Pinned block_id = 6, filled = 99.875595 % Done! Loaded 85 layers from weights-file
could you tell me why?
I tried yolov3-spp.cfg with following setting : optimized_memory=3 workspace_size_limit_MB=1000 my cpu-ram is 64g, after loading use 20.9g but always stuck at here
net.optimized_memory = 3 batch = 1, time_steps = 1, train = 0 yolov3-spp net.optimized_memory = 3 pre_allocate... pinned_ptr = 0000000000000000 pre_allocate: size = 8192 MB, num_of_blocks = 8, block_size = 1024 MB Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block Allocated 1073741824 pinned block batch = 8, time_steps = 1, train = 1 Pinned block_id = 0, filled = 88.134911 % Pinned block_id = 1, filled = 96.948578 % Pinned block_id = 2, filled = 96.949005 % Pinned block_id = 3, filled = 99.152946 % Pinned block_id = 4, filled = 99.153809 % Pinned block_id = 5, filled = 98.830368 % Pinned block_id = 6, filled = 99.875595 % Done! Loaded 85 layers from weights-file
could you tell me why?
now, I got error
CUDA Error: invalid device pointer: No error Assertion failed: 0, file ....\src\utils.c, line 325
Just tried to run with this on:
batch=64
subdivisions=4
dynamic_minibatch=1
width=960
height=576
optimized_memory=3
workspace_size_limit_MB=8000
and got this error:
CUDA status Error: file: /home/lucas/Development/darknet/src/dark_cuda.c : () : line: 454 : build time: May 18 2020 - 15:30:02
CUDA Error: invalid device pointer
CUDA Error: invalid device pointer: Resource temporarily unavailable
I've tried several different values for workspace_size_limit_MB and subdivisions and all fail with the same message. I was running with a single gpu, and I peaked at about 40 GB / 64 GB memory usage on the cpu.
@WongKinYiu @AlexeyAB @cenit @LukeAi
Hi everyone! Two simple questions I could not find answers everywhere else... Even on google scholar for the second one...
1) ~Is it possible to use dynamic_mini batch=1 while using custom resize of the network eg: "random=1.34"?~ |--> Yes 2) ~Is it possible to use dynamic_mini batch=1 and batch_normalize=2 at the same Time Without messing everything up?~ |--> Yes 3) ~How is it possible that the mini_batch parameter has an influence on mAP with consistent batch size?~ |--> Because Batch normalization is done on Mini-Batch size and not on Batch size.
Has far as my understanding goes, the batch size is the number of samples processed before the weighs update but mini_batch is just a computational trick to avoid loading and processing the batch in one time and should not have an impact...
I would be very happy with an answer to those questions and I'm sure I am not alone not understanding.
What parameters I can use with nVidia Quadro M1000M
(GPU_RAM = 2GB
) and I7
+ CPU_RAM = 64GB
?
###
# Training
batch=64
subdivisions=8
###
width=608
height=608
###
optimized_memory=3
workspace_size_limit_MB=2000
mini_batch=16
Tried to use these, but 100h+ for training - too long.
On other PC with GTX970 4GB
and I5 16GB
with parameters
###
# Training
batch=64
subdivisions=16
###s
width=608
height=608
I've got ~16-20h of training
Classes=5
, max iterations= 10000
.
On laptop with settings:
###
# Training
batch=64
subdivisions=32
###
width=608
height=608
### NOT USED ###
# optimized_memory=3
# workspace_size_limit_MB=2000
# mini_batch=16
getting this:
Btw this is Tiny YoloV4
@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.
@igoriok1994 what are you trying to achieve? What is your end goal or output? It Will help with recommended settings.
I want to speed up training without mAP loss :)
@igoriok1994 CPU memory is very slow, in my experience 5x + slower than regular GPU training. The benefit of CPU memory training is to increase precision (mAP) by increasing the batch size beyond the memory available on your GPU.
@WongKinYiu Hi,
I implemented part of CBN - averaging statistic inside one batch. So you can increase accuracy just by increasing
batch=
in cfg-file, and setcbn=1
instead ofbatch_normalize=1
Sobatch=120 subdivisions=4
with CBN, should work better thanbatch=120 subdivisions=4
with BN. Butbatch=120 subdivisions=4
with CBN, will work worse thanbatch=120 subdivisions=1
with BN.I.e. using
batch=64 subdivisions=8
with BN, avgmini_batch_size = 8
64/8 = 8I.e. using
batch=64 subdivisions=8
with CBN, avgmini_batch_size = 36
(8+16+24+32+40+48+56+64)/8 = 36You can try it on Classifier csresnext50
So inside 1 batch it will average the values of Mean and Variance. I.e if you train with batch=64 subdivisions=16, then will be 16 mini_batches with size 4.
- For the 1st mini_batch will use Mean[1] & Variance[1]
- For the 2nd mini_batch will use avg(Mean[1], Mean[2]) & avg(Variance[1], Variance[2])
- For the 3rd mini_batch will use avg(Mean[1], Mean[2], Mean[3]) & avg(Variance[1], Variance[2], Variance[3]) ....
For using:
[convolutional] cbn=1 filters=16 size=3 stride=1 pad=1 activation=leaky
or
[convolutional] batch_normalize=1 cbn=1 filters=16 size=3 stride=1 pad=1 activation=leaky
or
[convolutional] batch_normalize=2 filters=16 size=3 stride=1 pad=1 activation=leaky
Since we change weights (conv-weights, biases, scales) only after processing the whole batch entirely, then if we use Averaging inside 1 batch (without Cross-iteration) then we will not have problems with statistics obsolescence.
Paper: https://arxiv.org/abs/2002.05712v2
I used these formulas:
does we need to change batch_normalize's setting in every convolutional part in cfg file ? , the numbers of convolutional is 73 @AlexeyAB
@AlexeyAB have you seen this implementation for decreasing memory usage allowing larger batches with the same GPU memory? https://github.com/MegEngine/MegEngine/wiki/Reduce-GPU-memory-usage-by-Dynamic-Tensor-Rematerialization
Higher mini_batch -> higher accuracy mAP/Top1/Top5.
Training on GPU by using CPU-RAM allows significantly increase the size of the mini batch 4x-16x times and more.
You can train with 16x higher mini_batch, but with 5x lower speed on Yolov3-spp, it should give you ~+2-4 mAP.
Use in your cfg-file:
random=1
is not supportedTested:
Tested on model https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg with
wifth=416 height=416
on 8GB_GPU_VRAM + 32GB_CPU_RAM./darknet detector train data/obj.data yolov3-spp.cfg -map
default
: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=6.5 GB, iteration = 3 secoptimized_memory=1
: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=5.8 GB, iteration = 3 secoptimized_memory=2 workspace_size_limit_MB=1000
: mini_batch=20 = batch_60 / subdivisions_3, GPU-RAM-usage=5.4 GB, iteration = 15 secoptimized_memory=3 workspace_size_limit_MB=1000
: mini_batch=32 = batch_64 / subdivisions_2, GPU-RAM-usage=4.0 GB, iteration = 15 sec (CPU-RAM-usage = 31 GB)Not well tested yet:
optimized_memory=3 workspace_size_limit_MB=2000
: mini_batch=64 = batch_128 / subdivisions_2, GPU-RAM-usage=7.5 GB, iteration = 15 sec (CPU-RAM-usage = 62 GB)optimized_memory=3 workspace_size_limit_MB=2000
or4000
: mini_batch=128 = batch_256 / subdivisions_2, GPU-RAM-usage=13.5 GB, iteration = 15 sec (CPU-RAM-usage = 124 GB)mini_batch=24 - 24 GB VRAM RTX Titan - $2500: https://www.amazon.com/NVIDIA-Titan-RTX-Graphics-Card/dp/B07L8YGDL5
mini_batch=48 - 48 GB VRAM Quadro RTX 8000 - $5500: https://www.amazon.com/PNY-VCQRTX8000-PB-NVIDIA-Quadro-Graphic/dp/B07NH3HKG9/
mini_batch=128 - 128 GB RAM - $1700 = RTX 2080 Ti 11 GB - $1100 + $600 CPU-RAM 128 GB = 4x32 +
with this software solution
mini_batch=512 - 512 GB RAM - $9200 = 48 GB VRAM Quadro RTX 8000 - $5500 + 512GB=2 x (8 x 32GB), $2600 + $1100 - CPU AMD EPYC 7401P - 32 cores, 16 memory slots up to 2 TB RAM and 128 PCIe® 3.0 lanes +
with this software solution
mini_batch=512 - 512 GB VRAM (16 x 32GB Tesla V100) DGX2 - $400 000 https://www.nvidia.com/en-us/data-center/dgx-2/ + with synchronized batch normalization technique solution like: https://arxiv.org/abs/1711.07240v4
Example of trained model: yolov3-tiny_pan_gaus_giou_scale.cfg.txt
+5 mAP@0.5