MxNet R, CNN, VRAM consumption explodes dramatically in dependence of number of filters

thomasmooon commented 6 years ago

Description

I have a toy dataset of 360 samples with 4096 data points each, leading to a tensor of shape (4096,1,360). Hence, each observation has a size of ~ 4 kB. The CNN is very simple: Conv -> flatten -> fully connected -> fully connected -> softmax:

The VRAM consumption explodes in dependence of the number of filters: Please see the table and the related picture below. Regarding the influence of the kernel size and the batch size: These have very small influence, I've tested several combinations, but I omit these details for now. The tables measure a setting using 2 GPUs of my environment (described in the environment setting below). As one can see the VRAM demand of each card increases, as expected, linear with the number of convolution filters. But if it exceeds 10, then the GPUs run out of their 8 GB VRAM. What the hell...?

It is also remarkable, that in a setting with 1 GPU and 8 kernels is not possible: It exhausts the 8 GB RAM of the single Card. But using 2 GPUs with everything else unchanged, then each GPU consumes only 0.477 GB, so 2x0.477 = 0.95 GB in total. This is far beyond of what is consumed when using only 1 Card. How can this be??

Things else tested without any effect: The argument workspace in the mx.symbol.Convolution()-Function. I played with several values: 1, 64, 128, 512 MB. But his had absolutely none effect disregarding to any combination of varying number of filters. Here's the defintion of workspace :

long (non-negative), optional, default=1024 Maximum temporary workspace allowed for convolution (MB)

*#VRAM consumption in dependence of the number of filters, using 2 GPUs* n_filter	VRAM Consumption / Card
1	313
2	339
4	385
8	477
10	523
11	out of memory

In addition I measured the RAM consumption if the device is CPU, hence no usage of GPUs. I tried values of 10, 11 and 20 filters. What you can see is, that the RAM consumption increases linear, especially when increasing vom 10 to 11, rather than exploding if the device are GPUs. This is confusing. In addition, the RAM consumption using 10 filters is 9 GB, in alignment with the observation, that the VRAM of 8 GB of one GPU is insufficient. But, again, in contradiction to the 0.95 GB if 2 GPUs are used.

For R user, please provide R `sessionInfo()`:

R version 3.4.3 (2017-11-30) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux

Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] bindrcpp_0.2 mxnet_0.10.1

loaded via a namespace (and not attached): [1] Rcpp_0.12.12 compiler_3.4.3 RColorBrewer_1.1-2 influenceR_0.1.0
[5] plyr_1.8.4 bindr_0.1 viridis_0.4.0 tools_3.4.3
[9] digest_0.6.12 jsonlite_1.5 tibble_1.3.3 gtable_0.2.0
[13] viridisLite_0.2.0 rgexf_0.15.3 pkgconfig_2.0.1 rlang_0.1.1
[17] igraph_1.1.2 rstudioapi_0.6 yaml_2.1.14 gridExtra_2.2.1
[21] DiagrammeR_0.9.0 dplyr_0.7.2 stringr_1.2.0 htmlwidgets_0.9
[25] grid_3.4.3 glue_1.1.1 R6_2.2.2 Rook_1.1-1
[29] XML_3.98-1.9 ggplot2_2.2.1 magrittr_1.5 codetools_0.2-15
[33] scales_0.4.1 htmltools_0.3.6 assertthat_0.2.0 colorspace_1.3-2
[37] brew_1.0-6 stringi_1.1.5 visNetwork_2.0.0 lazyeval_0.2.0
[41] munsell_0.4.3

Hardware

8 x 1080 TI 60 GB RAM 12 Cores

cuda version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

Minimum reproducible example


require(mxnet)

# create toy data
#-------------------------------------------------------------------------------

nSamp <- 360
nObs <- 64*64 # =4096

# create labels
nLabel <- 40
set.seed(1) # seed for label sampling
label <- sample(1:40,nSamp,T)

# create training data set
train <- sapply(label, function(x) rpois(nObs, x)) # dim = 360 x 4096 = nSamp x nObs
dim(train) <- c(4096,1,1,360)

trainIter <-
  mx.io.arrayiter(
    data = train,
    label = label,
    batch.size = 128,
    shuffle = T
  )

# measure influence to VRAM / RAM demand
#-------------------------------------------------------------------------------

The results in this example slightly differ from the descriptions above. This is just because in my true example, I use true data. In this example, data is sampled from random numbers for reproducibility. The effect of exploding VRAM in this example starts when exceeding 10 filters.

kernel <- c(64*12,1) 

# example with 1 CPU ####
  # nGPU <- 1
  # array.batch.size <- 1
  # workspace <- 1024
  # num_filter <- 1 # 3.8 GB VRAM 
  # num_filter <- 2  # 6.4 GB VRAM
  # num_filter <- 3 # out of memory

# example with 2 GPU ###
  nGPU <- 2
  array.batch.size <- 1
  workspace <- 1024
  # num_filter <- 1 # 0.313 GB VRAM / Card
  # num_filter <- 2  # 0.339 GB VRAM / Card
  # num_filter <- 4  # 0.385
  # num_filter <- 8  # 0.477
  num_filter <- 10  # 0.523
  # num_filter <- 11  # out of memory
  # num_filter <- 16  # out of memory

# device setup
#-------------------------------------------------------------------------------
devices <- lapply(seq(nGPU)-1, mx.gpu)
# devices <- mx.cpu()

# Set up the symbolic model
#-------------------------------------------------------------------------------
data <- mx.symbol.Variable('data')
# convolution
conv_1 <- mx.symbol.Convolution(data = data, kernel = kernel, num_filter = num_filter, workspace = workspace) 
tanh_1 <- mx.symbol.Activation(data = conv_1, act_type = "tanh")
# 1st fully connected layer
bn_2 <- mx.symbol.BatchNorm(tanh_1)
flatten <- mx.symbol.Flatten(data = bn_2)
fc_1 <- mx.symbol.FullyConnected(data = flatten, num_hidden = 500)
tanh_3 <- mx.symbol.Activation(data = fc_1, act_type = "tanh")
# 2nd fully connected layer
bn_3 <- mx.symbol.BatchNorm(tanh_3)
fc_2 <- mx.symbol.FullyConnected(data = bn_3, num_hidden = 40)
# Output. Softmax output since we'd like to get some probabilities.
NN_model <- mx.symbol.SoftmaxOutput(data = fc_2)

# graph.viz(NN_model)

# Pre-training set up
#-------------------------------------------------------------------------------

# Set seed for reproducibility
mx.set.seed(100)

# Training
#-------------------------------------------------------------------------------

# Train the model
model <- mx.model.FeedForward.create(
  NN_model,
  kvstore = "local",
  X = trainIter,
  ctx = devices,
  num.round = 150,
  learning.rate = 0.01,
  momentum = 0.9,
  eval.metric = mx.metric.accuracy,
  epoch.end.callback = mx.callback.log.train.metric(array.batch.size)
)

Steps to reproduce

Comment / uncomment the lines in the section

# measure influence to VRAM / RAM demand
#-------------------------------------------------------------------------------

and use nvidia-smi -l 3 to monitor memory consumption. I recommend to run script not in R, rather from shell for convenience (R will crash when VRAM exceeds).

To measure the RAM consumption using CPU, comment content in this section and monitor e. g. with htop

# device setup
#-------------------------------------------------------------------------------

What have you tried to solve it?

Varied these parameters:

batch size (1,2,64,128)
different kernels: c(6412,1), c(64,64,1), c(64/2,64 2,1), ...
workspace: 1, 64, 128, 512, 1024 MB
single-GPU / multi-GPU / CPU - device tests
asked colleagues
asked google
4D-tensor with 2D-kernel e. g. (4096,1,1,360) x (64*12,1)
3D-tensor with 1D-Kernel, e. g. (4096,1,360) x (64*12)

jeremiedb commented 6 years ago

@thomasmooon maybe you can test whether #11374 effectively solves this RAM comsumtionissue?

jeremiedb commented 6 years ago

@thomasmooon I just ran your example with num_filter = 32 and no workspace parameter and model ran properly on a single 1060, staying aroung stable around 2.7 Go Ram on the GPU.

@nswamy Can you close this issue?

nswamy commented 6 years ago

thanks @jeremiedb

thomasmooon commented 6 years ago

@jeremiedb I was on vacancy leave and just read your posts. Thanks for your suggestion. But in the meanwhile, a few weeks after I opened the issue, I switched to another DL framework for several reasons.

jeremiedb commented 6 years ago

@thomasmooon Sure I understand as the support for R-package hasn't been great. May I ask you if there were other specific features you were seeing as lacking? Thanks!

thomasmooon commented 6 years ago

@jeremiedb Well, in general my experience is that a better documentation is desirable. Especially with minimum reproducible runnable examples for R for each layer / method. Hence, if I would restart with MXNet I'd first learn python and then use the MXNet Python. This doesn't answer your "specific feature" question, there were / are a lot of small things in my use cases demanding hacking a lot around using MXNet whilst in my framework of current choice this is not the case. Special hallmarks of MXNet, like a relatively high speed are valuable in general of course, but not that critical in my case.

apache / mxnet