VGG16 in zoo using 10GB of memory on batch 16

deeplearning4j / deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...

http://deeplearning4j.konduit.ai

Apache License 2.0

13.55k stars 3.83k forks source link

VGG16 in zoo using 10GB of memory on batch 16 #6889

Open crockpotveggies opened 5 years ago

crockpotveggies commented 5 years ago

Running an updated example using gradient sharing and VGG 16 from zoo. With a batch size of 16, zoo model uses 10GB of memory from defaults. This happens when isolated from ParallelWrapper. Have also tried setting WorkSpaceMode and CacheMode.

Try running this commit: https://github.com/deeplearning4j/dl4j-examples/blob/83a84f90ee0c9fd107c662bea74a5d578ce9322a/dl4j-cuda-specific-examples/src/main/java/org/deeplearning4j/examples/multigpu/vgg16/MultiGpuVGG16TinyImageNetExample.java

Aha! Link: https://skymindai.aha.io/features/DL4J-36

crockpotveggies commented 5 years ago

Machine specs:

no CuDNN
Ubuntu 18
2x Titan X GPU
built on snapshots from 1 day ago
no explicit memory config

AlexDBlack commented 5 years ago

OK, so. This is also present on 1.0.0-beta3 (i.e., also seeing around 10GB memory used). Here's memory report (without CuDNN), which provides some insight: https://gist.github.com/AlexDBlack/2ffc2a9de0fd5fc5727af05f531bb937

WS_LAYER_WORKING_MEM at 3.71GB seems excessive. (This is working memory required for the layer with the largest working memory) WS_ALL_LAYERS_ACT at 2.01 GB is about 2x what the estimated "Total Activations Memory" (929.98 MB) requires, which also seems off. It should be about the same.

AlexDBlack commented 5 years ago

As for with cuDNN enabled: we're looking at a peak of a bit under 7GB total (1GB used by OS, so more like 6GB). WS_LAYER_WORKING_MEM is obviously very low. No difference for WS_ALL_LAYERS_ACT. https://gist.github.com/AlexDBlack/e02b99d259f610de61163640d9602a05

"Use CuDNN" is an obvious first step/workaround. Still looking into workspaces size.