Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128

AlexeyAB commented 4 years ago

Higher mini_batch -> higher accuracy mAP/Top1/Top5.

Training on GPU by using CPU-RAM allows significantly increase the size of the mini batch 4x-16x times and more.

You can train with 16x higher mini_batch, but with 5x lower speed on Yolov3-spp, it should give you ~+2-4 mAP.

Use in your cfg-file:

[net]
batch=64
subdivisions=2
width=416
height=416
optimized_memory=3
workspace_size_limit_MB=1000

multi-GPU is not tested
random=1 is not supported

Tested:

GeForce RTX 2070 - 8 GB VRAM
CPU Core i7 6700K - 32 GB RAM

Tested on model https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg with wifth=416 height=416 on 8GB_GPU_VRAM + 32GB_CPU_RAM

./darknet detector train data/obj.data yolov3-spp.cfg -map

default: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=6.5 GB, iteration = 3 sec
optimized_memory=1: mini_batch=8 = batch_64 / subdivisions_8, GPU-RAM-usage=5.8 GB, iteration = 3 sec
optimized_memory=2 workspace_size_limit_MB=1000: mini_batch=20 = batch_60 / subdivisions_3, GPU-RAM-usage=5.4 GB, iteration = 15 sec
optimized_memory=3 workspace_size_limit_MB=1000: mini_batch=32 = batch_64 / subdivisions_2, GPU-RAM-usage=4.0 GB, iteration = 15 sec (CPU-RAM-usage = 31 GB)

Not well tested yet:

optimized_memory=3 workspace_size_limit_MB=2000: mini_batch=64 = batch_128 / subdivisions_2, GPU-RAM-usage=7.5 GB, iteration = 15 sec (CPU-RAM-usage = 62 GB)
optimized_memory=3 workspace_size_limit_MB=2000 or 4000: mini_batch=128 = batch_256 / subdivisions_2, GPU-RAM-usage=13.5 GB, iteration = 15 sec (CPU-RAM-usage = 124 GB)

mini_batch=24 - 24 GB VRAM RTX Titan - $2500: https://www.amazon.com/NVIDIA-Titan-RTX-Graphics-Card/dp/B07L8YGDL5
mini_batch=48 - 48 GB VRAM Quadro RTX 8000 - $5500: https://www.amazon.com/PNY-VCQRTX8000-PB-NVIDIA-Quadro-Graphic/dp/B07NH3HKG9/
mini_batch=128 - 128 GB RAM - $1700 = RTX 2080 Ti 11 GB - $1100 + $600 CPU-RAM 128 GB = 4x32 + with this software solution
mini_batch=512 - 512 GB RAM - $9200 = 48 GB VRAM Quadro RTX 8000 - $5500 + 512GB=2 x (8 x 32GB), $2600 + $1100 - CPU AMD EPYC 7401P - 32 cores, 16 memory slots up to 2 TB RAM and 128 PCIe® 3.0 lanes + with this software solution
mini_batch=512 - 512 GB VRAM (16 x 32GB Tesla V100) DGX2 - $400 000 https://www.nvidia.com/en-us/data-center/dgx-2/ + with synchronized batch normalization technique solution like: https://arxiv.org/abs/1711.07240v4

Example of trained model: yolov3-tiny_pan_gaus_giou_scale.cfg.txt

mini_batch=32 `+5 mAP@0.5`	mini_batch=8

---	---

HagegeR commented 4 years ago

do you think switching to this higher mini batch after having already train the usual way will give added value as well?

AlexeyAB commented 4 years ago

@HagegeR I didn't test it well. So just try.

In general - yes.

You can try to train the first several % of iterations with large mini_batch, then continue training with small mini_batch for fast training, and then continue training the last few percent of iterations with high mini_batch.

LukeAI commented 4 years ago

Please could you explain in more detail the meaning of the options or how to work out a good configuration? I'm trying to get this feature going with my custom gaussian cfg but I'm not having success so far. What does this mean? optimized_memory=3 workspace_size_limit_MB=1000

AlexeyAB commented 4 years ago

@LukeAI

Param optimized_memory= is related to GPU-memory optimization:

optimized_memory=0 - there is no additional memory optimization (by default)
optimized_memory=1 - there is optimized delta_gpu, instead of many arrays - it allocates 2 global_delta_gpu & state_delta_gpu arrays which will be used for the most of layers. It doesn't slowdown training, but can work incorrectly on a new models which will be made later.
optimized_memory=2 - also will be used CPU-RAM instead of GPU-VRAM for array output_gpu (output of layer), activation_input_gpu (input of activation) and x_gpu (input of batch-normalization) in each of layer
optimized_memory=3 - also it will use CPU-RAM instead of GPU-VRAM for arrays global_delta_gpu & state_delta_gpu
workspace_size_limit_MB=1000 - will be used 1000 MB for cuDNN-workspace.
- If GPU memory is not enough (CUDA out of memory), then try to reduce this value.
- If Darknet is halted or falls with strange errors - try to increase this value.
- (Try to use 1000 if you have 32 GB CPU-RAM and 2000 if 64 CPU-RAM)
- if GPU is lost - try to reboot your PC

For Yolov3-spp 416x416 model on 8GB-GPU and 32GB-CPU-RAM try to use: https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3-spp.cfg

[net]
batch=64
subdivisions=2
width=416
height=416
optimized_memory=3
workspace_size_limit_MB=1000

I'm trying to get this feature going with my custom gaussian cfg but I'm not having success so far.

What problem did you encounter?

What GPU do you use? How many CPU-RAM do you have? Rename your cfg-file to txt file and attach.

AlexeyAB commented 4 years ago

Such accuracy:

MobileNetv3 - Top1 75.37%
MixNet-S - Top1 75.68%
EfficientNetB0 - Top1 76.3%

can be achieved only if you train with very large mini_batch size (~1024):

either you use TPU-cluster ~1M$ or DGX-2 400K$ with synchronized batch-normalization (which slows down training) https://arxiv.org/abs/1711.07240v4
or you use CPU-RAM instead of GPU-RAM which 100x time cheaper, but slows down training more (except IBM Power8-CPU with nVlink between CPU & GPUs) https://github.com/AlexeyAB/darknet/issues/4386

With small mini_batch size (~32) instead of Top1 76.3% we get: https://github.com/AlexeyAB/darknet/issues/3380#issuecomment-501263052

Our EfficientNet B0 (224x224) 0.9 BFLOPS - 0.45 B_FMA (16ms / RTX 2070), 4.9M params - 71.3% Top1
Official EfficientNetB0 (224x224) 0.78 BFLOPS - 0.39 FMA, 5.3M params - 70.0% Top1

erikguo commented 4 years ago

@AlexeyAB

I tried mixnet_m_gpu.cfg with following setting :

optimized_memory=2
workspace_size_limit_MB=1000

I always get the following error:

 243 conv    600       1 x 1/ 1      1 x   1 x1200 ->    1 x   1 x 600 0.001 BF
 244 conv   1200       1 x 1/ 1      1 x   1 x 600 ->    1 x   1 x1200 0.001 BF
 245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
CUDA status Error: file: ./src/dark_cuda.c : () : line: 423 : build time: Dec  3 2019 - 23:02:36 
CUDA Error: invalid argument
CUDA Error: invalid argument: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

Could you help to find out the cause?

AlexeyAB commented 4 years ago

@erikguo I fixed it: https://github.com/AlexeyAB/darknet/commit/5d0352f961f4dc3db8ccad0570481c69305c0143

Just tried mixnet_m_gpu.cfg with

[net]
# Training
batch=120
subdivisions=2
optimized_memory=3
workspace_size_limit_MB=1000

erikguo commented 4 years ago

Thank you very much!

I will try now.

erikguo commented 4 years ago

By the way, I found the 'Decay' value (0.00005) is different from the other cfg(decay=0.0005) in mixnet_m_g.cfg as following:

momentum=0.9
decay=0.00005

It's a special setting for mixnet_m_gpu.cfg ? or just a type error?

@AlexeyAB

erikguo commented 4 years ago

@AlexeyAB

Still get error as following:

Pinned block_id = 3, filled = 99.917603 % 
 241 route  240 238 236 234                        ->    9 x   3 x1200 
 242 avg                             9 x   3 x1200 ->   1200
 243 conv    600       1 x 1/ 1      1 x   1 x1200 ->    1 x   1 x 600 0.001 BF
 244 conv   1200       1 x 1/ 1      1 x   1 x 600 ->    1 x   1 x1200 0.001 BF
 245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
 251 conv     51       1 x 1/ 1      1 x   1 x1536 ->    1 x   1 x  51 0.000 BF
 252 softmax                                          51

 Pinned block_id = 4, filled = 98.600769 % 
Total BFLOPS 0.592 
 Allocate additional workspace_size = 18.58 MB 
Loading weights from backup_all/mixnet_m_gpu_last.weights...
 seen 64 
Done! Loaded 253 layers from weights-file 
Learning Rate: 0.064, Momentum: 0.9, Decay: 0.0005
304734
Loaded: 0.933879 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec  3 2019 - 23:02:38 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

AlexeyAB commented 4 years ago

@erikguo Do you get this error if you disable memory optimization? Comment these lines:

#optimized_memory=3
#workspace_size_limit_MB=1000

By the way, I found the 'Decay' value (0.00005) is different from the other cfg(decay=0.0005)

Since Mixnet is a continuation of the EfficientNet that is a continuation of the (MobileNet ...), in the EfficientNet is used decay=0.00001 https://arxiv.org/pdf/1905.11946v2.pdf

weight decay 1e-5;

erikguo commented 4 years ago

After comment these lines, the training is running very well. If using these lines, it can run well occasionally. But It will crach in most of cases.

@AlexeyAB

AlexeyAB commented 4 years ago

@erikguo

How many iterations before crashing?
What is the error message?
How many CPU RAM do you have?
What GPU do you use?
Do you use GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=1 ?

erikguo commented 4 years ago

@AlexeyAB

It will crash at the first iteration.

Crash message is as the following:

Pinned block_id = 3, filled = 99.917603 % 
 241 route  240 238 236 234                        ->    9 x   3 x1200 
 242 avg                             9 x   3 x1200 ->   1200
 243 conv    600       1 x 1/ 1      1 x   1 x1200 ->    1 x   1 x 600 0.001 BF
 244 conv   1200       1 x 1/ 1      1 x   1 x 600 ->    1 x   1 x1200 0.001 BF
 245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
 251 conv     51       1 x 1/ 1      1 x   1 x1536 ->    1 x   1 x  51 0.000 BF
 252 softmax                                          51

 Pinned block_id = 4, filled = 98.600769 % 
Total BFLOPS 0.592 
 Allocate additional workspace_size = 18.58 MB 
Loading weights from backup_all/mixnet_m_gpu_last.weights...
 seen 64 
Done! Loaded 253 layers from weights-file 
Learning Rate: 0.064, Momentum: 0.9, Decay: 5e-05
304734
Loaded: 1.104122 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec  3 2019 - 23:02:38 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.
已放弃 (核心已转储)

My server has 128G memory, 4 x 1080ti 11G GPU.

Darknet is compiled with GPU=1 CUDNN=1 OPENCV=1 CUDNN_HALF=0

AlexeyAB commented 4 years ago

@erikguo

Do you use 4 x GPU for training?
What command do you use for training?
What batch and subdivisions did you set?

I just trained 2600 iterations successfully on RTX 2070 and CPU Core i7 32 GB CPU-RAM by using this command: darknet.exe classifier train cfg/imagenet1k_c.data cfg/mixnet_m_gpu.cfg backup/mixnet_m_gpu_last.weights -topk

and this cfg-file: mixnet_m_gpu.cfg.txt

erikguo commented 4 years ago

I use only one gpu for training.

Command as following:

darknet classifier train dengdi.data mixnet_m_gpu.cfg backup/mixnet_m_gpu_last.cfg -dont_show

batch and subdivsion as following:

batch=128
subdivisions=2

mixnet_m_gpu_mem.cfg.txt

@AlexeyAB

AlexeyAB commented 4 years ago

@erikguo

Why do you use height=96 width=288 ?
I successfully run training with your cfg-file mixnet_m_gpu_mem.cfg.txt on RTX 2070 8 GB-VRAM + 32 GB CPU_RAM darknet.exe classifier train cfg/imagenet1k_c.data cfg/mixnet_m_gpu_mem.cfg backup/mixnet_m_gpu_last.weights -topk

erikguo commented 4 years ago

@AlexeyAB

I have tried the following combination:

batch=128
subdivisions=2
running very well now

batch=256
subdivisions=2
running very well now

batch=256
subdivisions=1
running crashed in the first iteration

batch=512
subdivisions=2
running crashed in the first iteration

erikguo commented 4 years ago

because my image's aspect is about 1:3 (h:w). So I set the network size with rectangle.

AlexeyAB commented 4 years ago

@erikguo Check this combination: batch=128 subdivisions=1

batch=256 subdivisions=1 running crashed in the first iteration

Show screenshot of CPU_RAM usage
Show screenshot of GPU_RAM usage
Show screenshot of the error message

erikguo commented 4 years ago

My OS is Ubuntu 16.04

this combination is crashed two times and run well one time now. The execution is not stable: batch=128 subdivisions=1

this combination is bad, always crashed: batch=256 subdivisions=1

AlexeyAB commented 4 years ago

@erikguo Try to use workspace_size_limit_MB=8000

batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000

erikguo commented 4 years ago

Error messages are different: One is: CUDA status Error: file: ./src/blas_kernels.cu : () : line: 576 The other is : CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33

erikguo commented 4 years ago

The following setting crashed too. Same error as above. batch=256 subdivisions=1 optimized_memory=3 workspace_size_limit_MB=8000

Error message:

245 scale Layer: 241
 246 conv    200/   2  1 x 1/ 1      9 x   3 x1200 ->    9 x   3 x 200 0.006 BF
 247 Shortcut Layer: 231
 248 conv   1536       1 x 1/ 1      9 x   3 x 200 ->    9 x   3 x1536 0.017 BF
 249 avg                             9 x   3 x1536 ->   1536
 250 dropout       p = 0.25                  1536  ->   1536
 251 conv     51       1 x 1/ 1      1 x   1 x1536 ->    1 x   1 x  51 0.000 BF
 252 softmax                                          51
Try to allocate new pinned memory, size = 972 MB 

 Pinned block_id = 14, filled = 96.900558 % 
Try to allocate new pinned BLOCK, size = 81 MB 

 Pinned block_id = 15, filled = 95.586395 % 
Try to allocate new pinned BLOCK, size = 50 MB 

 Pinned block_id = 16, filled = 99.300003 % 
Try to allocate new pinned BLOCK, size = 12 MB 

 Pinned block_id = 17, filled = 99.920654 % 
Try to allocate new pinned BLOCK, size = 7 MB 
Total BFLOPS 0.592 
 Allocate additional workspace_size = 160.59 MB 
Loading weights from backup_all/mixnet_m_gpu_last.weights...
 seen 64 
Done! Loaded 253 layers from weights-file 
Learning Rate: 0.064, Momentum: 0.9, Decay: 5e-05
304734
Loaded: 1.654202 seconds
CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33 : build time: Dec  3 2019 - 23:02:38 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

AlexeyAB commented 4 years ago

@erikguo

Just to localize the problem, try to comment these 2 lines temporary and recompile:

Then try

batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000

erikguo commented 4 years ago

After recompiling, the error changed as following:

Learning Rate: 0.064, Momentum: 0.9, Decay: 5e-05
304734
Loaded: 1.968368 seconds
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec  4 2019 - 23:47:36 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

@AlexeyAB

AlexeyAB commented 4 years ago

@erikguo Also try to comment this line and recompile: https://github.com/AlexeyAB/darknet/blob/efc5478a23a3a3c66d6feefc6d6b485f13503bde/src/network_kernels.cu#L119

erikguo commented 4 years ago

@AlexeyAB

After recompiling, run two times with same command and same cfg.

The first error:

CUDA status Error: file: ./src/blas_kernels.cu : () : line: 576 : build time: Dec  5 2019 - 00:02:32 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists

The second error:

CUDA status Error: file: ./src/blas_kernels.cu : () : line: 668 : build time: Dec  5 2019 - 00:02:32 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists

AlexeyAB commented 4 years ago

@erikguo Ok, Thanks I will try to find a bug. Just to be sure, and you also comment these both lines?

erikguo commented 4 years ago

@AlexeyAB ,

After comment and recompile, the error change to ：

CUDA status Error: file: ./src/blas_kernels.cu : () : line: 564 : build time: Dec  5 2019 - 00:14:54 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists

Now I commented three files: darknet/src/blas_kernels.cu darknet/src/network_kernels.cu darknet/src/dropout_layer_kernels.cu

AlexeyAB commented 4 years ago

@erikguo Thanks. Can you compile with DEBUG=1 in the Makefile and run training again? https://github.com/AlexeyAB/darknet/blob/efc5478a23a3a3c66d6feefc6d6b485f13503bde/Makefile#L14

erikguo commented 4 years ago

@AlexeyAB ,

Errors:

 cuDNN status = cudaDeviceSynchronize() Error in: file: ./src/convolutional_kernels.cu : () : line: 823 : build time: Dec  5 2019 - 00:41:31 
cuDNN Error: CUDNN_UNKNOWN_STATUS
cuDNN Error: CUDNN_UNKNOWN_STATUS: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

AlexeyAB commented 4 years ago

@erikguo Thanks.

Also do you get this issue if you remove [dropout] layer from the end of your cfg-file?

erikguo commented 4 years ago

@AlexeyAB

I have comment [dropout] layer from the end of cfg.

Not stable now. Crashed at first times and third time. Run well at second time. Got same error:

CUDA status = cudaDeviceSynchronize() Error: file: ./src/blas_kernels.cu : () : line: 564 : build time: Dec  5 2019 - 00:41:31 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

It seems that it maybe crashed in the mid of first iteration, because I should wait about 15s then it crashed.

AlexeyAB commented 4 years ago

@erikguo

Run well at second time.

When it starts up well, will it crash later? Or will it work well until the end?

erikguo commented 4 years ago

@AlexeyAB ,

When I said running well, means it can run more than 10 iterations very well and not crashed. I just Ctrl-C to interrupt it to run another time.

AlexeyAB commented 4 years ago

@erikguo

When I said running well, means it can run more than 10 iterations very well and not crashed.

So undo all these changes

Compile with DEBUG=0

Set

batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000

And try to run several times, when it starts up well, let it work further, will it crash later?

erikguo commented 4 years ago

OK, I will try it. Report back later

erikguo commented 4 years ago

@AlexeyAB

I undo all comment done last night. and recompiled.

If I leave [dropout] layer uncommented, it always crash immediately after loading cfg and weight file. So I commented the [dropout] layer in cfg.

I run several times. It crashed randomly in the first iteration. However, once it can finished the first iteration, it will run well, never crash. But the loss are going to be nan after several iteration, even I lower the learning rate.

The following is the running logs:

Loading weights from backup_all/mixnet_m_gpu_last.weights...
 seen 64 
Done! Loaded 252 layers from weights-file 
Learning Rate: 0.016, Momentum: 0.9, Decay: 5e-05
382473
Loaded: 1.856072 seconds
71457, 47.828: 0.002255, 0.002255 avg, 0.011005 rate, 41.015572 seconds, 18292992 images
Loaded: 0.000049 seconds
71458, 47.829: 0.006014, 0.002631 avg, 0.011005 rate, 42.169849 seconds, 18293248 images
Loaded: 0.000042 seconds
71459, 47.830: 4.707831, 0.473151 avg, 0.011005 rate, 40.191479 seconds, 18293504 images
Loaded: 0.000060 seconds
71460, 47.830: nan, nan avg, 0.011005 rate, 39.108238 seconds, 18293760 images
Loaded: 0.000100 seconds
71461, 47.831: nan, nan avg, 0.011005 rate, 39.716503 seconds, 18294016 images
Loaded: 0.000058 seconds
71462, 47.832: nan, nan avg, 0.011004 rate, 39.298252 seconds, 18294272 images
Loaded: 0.000081 seconds
71463, 47.832: nan, nan avg, 0.011004 rate, 39.715801 seconds, 18294528 images
Loaded: 0.000074 seconds
71464, 47.833: nan, nan avg, 0.011004 rate, 39.716663 seconds, 18294784 images
Loaded: 0.000061 seconds
71465, 47.834: nan, nan avg, 0.011004 rate, 39.147743 seconds, 18295040 images
Loaded: 0.000084 seconds
71466, 47.834: nan, nan avg, 0.011004 rate, 39.735199 seconds, 18295296 images
Loaded: 0.000074 seconds
71467, 47.835: nan, nan avg, 0.011004 rate, 40.027672 seconds, 18295552 images
Loaded: 0.000072 seconds
71468, 47.836: nan, nan avg, 0.011004 rate, 39.932713 seconds, 18295808 images
Loaded: 0.000073 seconds
71469, 47.836: nan, nan avg, 0.011004 rate, 39.481960 seconds, 18296064 images
Loaded: 0.000114 seconds
71470, 47.837: nan, nan avg, 0.011004 rate, 40.012989 seconds, 18296320 images
Loaded: 0.000082 seconds
71471, 47.838: nan, nan avg, 0.011004 rate, 39.614643 seconds, 18296576 images
Loaded: 0.000069 seconds
71472, 47.838: nan, nan avg, 0.011004 rate, 39.501343 seconds, 18296832 images
Loaded: 0.000077 seconds
71473, 47.839: nan, nan avg, 0.011004 rate, 39.760441 seconds, 18297088 images
Loaded: 0.000063 seconds
71474, 47.840: nan, nan avg, 0.011004 rate, 39.416786 seconds, 18297344 images
Loaded: 0.000070 seconds
71475, 47.840: nan, nan avg, 0.011004 rate, 39.673023 seconds, 18297600 images
Loaded: 0.000075 seconds
71476, 47.841: nan, nan avg, 0.011004 rate, 39.329891 seconds, 18297856 images
Loaded: 0.000077 seconds
71477, 47.842: nan, nan avg, 0.011004 rate, 40.461945 seconds, 18298112 images
Loaded: 0.000072 seconds
71478, 47.842: nan, nan avg, 0.011003 rate, 39.966011 seconds, 18298368 images
Loaded: 0.000063 seconds
71479, 47.843: nan, nan avg, 0.011003 rate, 39.231728 seconds, 18298624 images
Loaded: 0.000070 seconds
71480, 47.844: nan, nan avg, 0.011003 rate, 39.738995 seconds, 18298880 images
Loaded: 0.000096 seconds
71481, 47.844: nan, nan avg, 0.011003 rate, 40.647068 seconds, 18299136 images
Loaded: 0.000089 seconds
71482, 47.845: nan, nan avg, 0.011003 rate, 41.785786 seconds, 18299392 images
Loaded: 0.000087 seconds
71483, 47.846: nan, nan avg, 0.011003 rate, 40.824448 seconds, 18299648 images
Loaded: 0.000105 seconds
71484, 47.846: nan, nan avg, 0.011003 rate, 40.963627 seconds, 18299904 images
Loaded: 0.000076 seconds
71485, 47.847: nan, nan avg, 0.011003 rate, 40.498711 seconds, 18300160 images
Loaded: 0.000076 seconds
71486, 47.848: nan, nan avg, 0.011003 rate, 39.802647 seconds, 18300416 images
Loaded: 0.000075 seconds
71487, 47.848: nan, nan avg, 0.011003 rate, 40.423454 seconds, 18300672 images
Loaded: 0.000061 seconds
71488, 47.849: nan, nan avg, 0.011003 rate, 39.450256 seconds, 18300928 images
Loaded: 0.000083 seconds
71489, 47.850: nan, nan avg, 0.011003 rate, 40.406216 seconds, 18301184 images
Loaded: 0.000068 seconds
71490, 47.850: nan, nan avg, 0.011003 rate, 39.633228 seconds, 18301440 images
Loaded: 0.000076 seconds
71491, 47.851: nan, nan avg, 0.011003 rate, 39.777164 seconds, 18301696 images
Loaded: 0.000073 seconds

This is the errors once it crashed:

CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33 : build time: Dec  5 2019 - 21:46:20 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

And I also test enet-b0-nog.cfg (I remove all groups in [convolution] layers) with the following:

batch=256
subdivisions=1
optimized_memory=3
workspace_size_limit_MB=8000

Even I commented [dropout], it always crashed with error:

CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33 : build time: Dec  5 2019 - 21:46:20 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

AlexeyAB commented 4 years ago

Even I commented [dropout], it always crashed with error:

CUDA status Error: file: ./src/dropout_layer_kernels.cu : () : line: 33 : build time: Dec  5 2019 - 21:46:20 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

This is very strange, how can it crashes in DropOut layer if you commented all DropOut layers. Just to know, there are many DropOut layers in the EfficientNet

erikguo commented 4 years ago

@AlexeyAB ,

you are right that I forgot to comment all [dropout]. After commented all, the errors message:

# running nvidia-smi command found new messages when it crashed:
GPU 00000000:03:00.0: Detected Critical Xid Error
GPU 00000000:03:00.0: Detected Critical Xid Error

# crashed errors:
CUDA status Error: file: ./src/dark_cuda.c : () : line: 446 : build time: Dec  5 2019 - 21:46:18 
CUDA Error: an illegal memory access was encountered
CUDA Error: an illegal memory access was encountered: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

erikguo commented 4 years ago

For enet-b0-nog.cfg, I should set batch=96, subdivision=1, then it will run stably with CPU-RAM optimized_memory=3.

erikguo commented 4 years ago

I found a case: once it crashed, it's very hard to rerun well. You should wait some time, then try again. Maybe it can run well this time. It look like some memory not be released ? or some remainer in the memory, and wait for some time, OS system clean the remainer automatically

@AlexeyAB

AlexeyAB commented 4 years ago

@erikguo I noticed, that if it is crashed, especially with Out-of-memory (CPU/GPU-memory), then GPU-hardware device can be lost, so you should wait or reboot PC.

For enet-b0-nog.cfg, I should set batch=96, subdivision=1, then it will run stably with CPU-RAM optimized_memory=3.

What CPU-RAM and GPU-VRAM usage do you get?

erikguo commented 4 years ago

There's a lot of free CPU memory and GPU memory. After the crashed, I can run the other training very well immediately, but cannot run the CPU-MEM training.

In Windows, when crashed, GPU card will be lost. But in Ubuntu, it won't be lost. I used windows and ubuntu both before.

@AlexeyAB

erikguo commented 4 years ago

According to this phenomenon, I guess some memory allocation is 'random', so when the allocation is right, then no crash. Otherwise, It crash.

AlexeyAB commented 4 years ago

@erikguo

For the Pinned CPU-RAM should be allocated sequential physical block with 1GB size, so if you have 128 GB CPU-RAM, and you ran 128 applications each of which consumes 1 byte in each of 128 GB, then the Pinned memory can not be allocated at all. F.e. if you run 64 applications each of which consumes 1 byte in each of 64 GB, then can be allocated only 64 GB Pinned memory.

Maybe this is the reason for this behavior:

According to this phenomenon, I guess some memory allocation is 'random', so when the allocation is right, then no crash. Otherwise, It crash.

So strongly recommended reboot the system before runing Darknet with GPU-processing + CPU-RAM using, and don't load any other applications.

HagegeR commented 4 years ago

why does it need to be sequential?

AlexeyAB commented 4 years ago

@HagegeR

Oh yes, the Pinned CPU-memory blocks (GPU-Direct 1.0) do not have to be completely sequential. I confused this with GPU-Direct 3.0 (RDMA) when the GPU uses the CPU-memory of the remote computer through the Infiniband - in this case, the mapped memory should be a physically sequential block: GPU -> PCIe -> Computer_1(PCIeController) -> Infiniband -> Computer_2(PCIeController) -> CPU_RAM

left scheme RDMA_P2P_bars

erikguo commented 4 years ago

@AlexeyAB ,

I see. I will stop other applications on the server and try again at weekend.

BTW, did you run the CPU-MEM training with 4 GPUs together?

AlexeyAB / darknet

Beta: Using CPU-RAM instead of GPU-VRAM for large Mini_batch=32 - 128 #4386