apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

R version CUDA failure on Amazon ec2 (ubuntu14.04) #5755

Closed ceolium closed 7 years ago

ceolium commented 7 years ago

Environment info

Operating System: Amazon ec2 ubuntu 14.04LTS

Compiler: gcc

Package used (Python/R/Scala/Julia): R

MXNet version:

Or if installed from source:

MXNet commit hash (git rev-parse HEAD):

If you are using python package, please provide

Python version and distribution:

If you are using R package, please provide

R sessionInfo(): 3.3.3 RC

I am trying to install MXNet R version on Amazon Web Service EC2 (ubuntu 14.04LTS) by following the instruction: http://mxnet.io/get_started/ubuntu_setup.html.

First I downloaded CUDA 8toolkit from nvidia.

sudo dpkg -i cuda-repo-ubuntu1404_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

Then downloaded the latest cudnn file(cudnn-8.0-linux-x64-v6.0.tgz) and transfer it to ec2 instance by scp.

In ec2 console (accessed by SSH), I typed

tar xvzf cudnn-8.0-linux-x64-v5.1-ga.tgz
sudo cp -P cuda/include/cudnn.h /usr/local/cuda/include
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
sudo ldconfig 
(thou I originally transfered cuda install file on /usr/local/. So the two lines of codes copying files to my local directory.)

Then I install mxnet source file from git, made config.mk file, and modified the config.mk to USE_CUDA=1, and so on (for GPU usage). Moved to set-utils directory and compiled ubuntu r version shell script.

git clone https://github.com/dmlc/mxnet.git ~/mxnet --recursive

cd ~/mxnet
cp make/config.mk .
# If building with GPU, add configurations to config.mk file:
echo "USE_CUDA=1" >>config.mk
echo "USE_CUDA_PATH=/usr/local/cuda" >>config.mk
echo "USE_CUDNN=1" >>config.mk

cd ~/mxnet/setup-utils
bash install-mxnet-ubuntu-r.sh
Of course I added environment variable by following commands:

export CUDA_HOME=/usr/local/cuda-8.0
export CUDA_ROOT=/usr/local/cuda-8.0/bin
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda
PATH=${CUDA_HOME}/bin:${PATH}

FYI, I checked the nvidia driver is installed properly by 'nvidia-smi' command.

I launched R and punched,

library(mxnet) Then the output was

Rcpp Init> I ran some test code for mxnet and it worked fine.

So I proceed to run GPU using code (Lenet):

require(mxnet)
train <- read.csv('train.csv', header=TRUE)
test <- read.csv('test.csv', header=TRUE)
train <- data.matrix(train)
test <- data.matrix(test)

train.x <- train[,-1]
train.y <- train[,1]

train.x <- t(train.x/255)
test <- t(test/255)

# input
data <- mx.symbol.Variable('data')
# first conv
conv1 <- mx.symbol.Convolution(data=data, kernel=c(5,5), num_filter=20)
tanh1 <- mx.symbol.Activation(data=conv1, act_type="tanh")
pool1 <- mx.symbol.Pooling(data=tanh1, pool_type="max",
                      kernel=c(2,2), stride=c(2,2))
# second conv
conv2 <- mx.symbol.Convolution(data=pool1, kernel=c(5,5), num_filter=50)
tanh2 <- mx.symbol.Activation(data=conv2, act_type="tanh")
pool2 <- mx.symbol.Pooling(data=tanh2, pool_type="max",
                      kernel=c(2,2), stride=c(2,2))
# first fullc
flatten <- mx.symbol.Flatten(data=pool2)
fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=500)
tanh3 <- mx.symbol.Activation(data=fc1, act_type="tanh")
# second fullc
fc2 <- mx.symbol.FullyConnected(data=tanh3, num_hidden=10)
# loss
lenet <- mx.symbol.SoftmaxOutput(data=fc2)

train.array <- train.x
dim(train.array) <- c(28, 28, 1, ncol(train.x))
test.array <- test
dim(test.array) <- c(28, 28, 1, ncol(test))
n.gpu <- 4
device.gpu <- lapply(0:(n.gpu-1), function(i) {
mx.gpu(i)
})
mx.set.seed(0)
tic <- proc.time()
model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y,
                                ctx=device.gpu, num.round=5, array.batch.size=100,
                                learning.rate=0.05, momentum=0.9, wd=0.00001,
                                eval.metric=mx.metric.accuracy,
                                  epoch.end.callback=mx.callback.log.train.metric(100))

This is a basic tutorial code from mxnet page.

But I got following error messages:

Auto-select kvstore type = local_update_cpu
Start training with 4 devices
[07:05:37] /root/mxnet/dmlc-core/include/dmlc/logging.h:300: [07:05:37] src/storage/storage.cc:77: Compile with USE_CUDA=1 to enable GPU usage

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f296b8659cc]
[bt] (1) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(+0xed1be3) [0x7f296c51cbe3]
[bt] (2) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(+0xed43c3) [0x7f296c51f3c3]
[bt] (3) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x3f) [0x7f296c51f77f]
[bt] (4) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(MXNDArrayCreate+0x63d) [0x7f296c0e83bd]
[bt] (5) /usr/local/lib/R/site-library/mxnet/libs/mxnet.so(_ZN5mxnet1R7NDArray5EmptyERKN4Rcpp9DimensionERKNS2_6VectorILi19ENS2_15PreserveStorageEEE+0xdd) [0x7f295ac7ebbd]
[bt] (6) /usr/local/lib/R/site-library/mxnet/libs/mxnet.so(_ZN4Rcpp12CppFunction2INS_4XPtrIN5mxnet1R6NDBlobENS_15PreserveStorageEXadL_ZNS_25standard_delete_finalizerIS4_EEvPT_EELb0EEERKNS_9DimensionERKNS_6VectorILi19ES5_EEEclEPP7SEXPREC+0xd2) [0x7f295ac8b552]
[bt] (7) /usr/local/lib/R/site-library/Rcpp/libs/Rcpp.so(_Z23InternalFunction_invokeP7SEXPREC+0xd1) [0x7f2971c69cd1]
[bt] (8) /usr/lib/R/lib/libR.so(+0xce3c1) [0x7f29762a83c1]
[bt] (9) /usr/lib/R/lib/libR.so(Rf_eval+0x6fb) [0x7f29762ed5ab]

Error in mx.nd.internal.empty.array(shape, ctx) :
  [07:05:37] src/storage/storage.cc:77: Compile with USE_CUDA=1 to enable GPU usage

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f296b8659cc]
[bt] (1) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(+0xed1be3) [0x7f296c51cbe3]
[bt] (2) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(+0xed43c3) [0x7f296c51f3c3]
[bt] (3) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x3f) [0x7f296c51f77f]
[bt] (4) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(MXNDArrayCreate+0x63d) [0x7f296c0e83bd]
[bt] (5) /usr/local/lib/R/site-library/mxnet/libs/mxnet.so(_ZN5mxnet1R7NDArray5EmptyERKN4Rcpp9DimensionERKNS2_6VectorILi19ENS2_15PreserveStorageEEE+0xdd) [0x7f295ac7ebbd]
[bt] (6) /usr/local/lib/R/site-library/mxnet/libs/mxnet.so(_ZN4Rcpp12CppFunction2INS_4XPtrIN5mxnet1R6NDBlobENS_15PreserveStorageEXadL_ZNS_25standard_delete_finalizerIS4_EEvPT_EEL

I want to make sure:

I modified the config.mk file 'before' I actaully compile by 'bash install--mxnet-ubuntu-r.sh' command. Changed enviornment variables as many ways as possible. Repeated above steps at least 7 times. My final goal is to run a code which contains mxnet lenet by batch file(R CMD BATCH ~.R) I would be very appreciated if someone can actually solve my problem.

thirdwing commented 7 years ago

Can you show your log when compiling mxnet?

ceolium commented 7 years ago
ps -Wno-unused-variable -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -msse3 -I/usr/local/cuda-8.0/include -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSHADOW_USE_PASCAL=0 -DMXNET_USE_OPENCV=1 -I/usr/include/opencv   -fopenmp -DMSHADOW_USE_CUDNN=1  -I/root/mxnet/cub -DMXNET_USE_NVRTC=0 -MMD -c src/operator/convolution_v1.cc -o build/src/operator/convolution_v1.o
g++ -std=c++11 -c -DMSHADOW_FORCE_STREAM -Wall -Wsign-compare -O3 -I/root/mxnet/mshadow/ -I/root/mxnet/dmlc-core/include -fPIC -I/root/mxnet/nnvm/include -Iinclude -funroll-loops -Wno-unused-variable -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -msse3 -I/usr/local/cuda-8.0/include -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSHADOW_USE_PASCAL=0 -DMXNET_USE_OPENCV=1 -I/usr/include/opencv   -fopenmp -DMSHADOW_USE_CUDNN=1  -I/root/mxnet/cub -DMXNET_USE_NVRTC=0 -MMD -c src/operator/correlation.cc -o build/src/operator/correlation.o
g++ -std=c++11 -c -DMSHADOW_FORCE_STREAM -Wall -Wsign-compare -O3 -I/root/mxnet/mshadow/ -I/root/mxnet/dmlc-core/include -fPIC -I/root/mxnet/nnvm/include -Iinclude -funroll-loops -Wno-unused-variable -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -msse3 -I/usr/local/cuda-8.0/include -DMSHADOW_USE_CBL
ceolium commented 7 years ago
::op::ActivationParam) [with DType=mshadow::half::half_t]"
src/operator/activation.cu(27): here

src/operator/./cudnn_activation-inl.h(137): warning: variable "beta" was declared but never referenced
          detected during:
            instantiation of "void mxnet::op::CuDNNActivationOp<DType>::Backward(const mxnet::OpContext &, const std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob>> &, const std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob>> &, const std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob>> &, const std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType>> &, const std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob>> &, const std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob>> &) [with DType=mshadow::half::half_t]"
(44): here
ceolium commented 7 years ago
src/operator/./cudnn_convolution-inl.h(286): error: too few arguments in function call

1 error detected in the compilation of "/tmp/tmpxft_00007035_00000000-5_convolution_v1.cpp4.ii".
make: *** [build/src/operator/convolution_v1_gpu.o] Error 2
thirdwing commented 7 years ago

Can you tell me which AWS instance you used?

ceolium commented 7 years ago

It's ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20160627 (ami-2d39803a) [type g2.8xlarge]

ceolium commented 7 years ago

It seems like the code is using gpu resource because when I put 'nvidia-smi' to server, it says I am using it. But when I ran it on other properly working ubuntu server (14.04 LTS also) shows messages like below: used (Mb) gc trigger (Mb) max used (Mb) Ncells 1459016 78.0 2637877 140.9 1459016 78.0 Vcells 14991815 114.4 22282032 170.0 14991815 114.4 [1] TRUE [1] TRUE [1] TRUE [1] TRUE [[11:31:3111:31:31] ] /home/mining/mxnet/dmlc-core/include/dmlc/logging.h/home/mining/mxnet/dmlc-core/include/dmlc/logging.h::235235: : [11:31:31] src/operator/./convolution-inl.h:370: Check failed: ksizey <= dshape[2] + 2 * param.pad[0] && ksizex <= dshape[3] + 2 * param.pad[1] kernel size exceed input[11:31:31] src/operator/./convolution-inl.h:370: Check failed: ksizey <= dshape[2] + 2 * param.pad[0] && ksizex <= dshape[3] + 2 * param.pad[1] kernel size exceed input

thirdwing commented 7 years ago

I am closing this since it has been inactive for quite a while. Feel free to reopen if necessary.