dmlc / MXNet.jl

MXNet Julia Package - flexible and efficient deep learning in Julia
371 stars 70 forks source link

Segfault running MNIST example lenet-stn.jl #369

Open rickhg12hs opened 6 years ago

rickhg12hs commented 6 years ago

lenet.jl example seems to run OK, but lenet-stn.jl segfaults.

$ julia -e 'versioninfo(); include("lenet-stn.jl")'
Julia Version 0.6.1
Commit 0d7248e2ff (2017-10-24 22:15 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
INFO: Start training on MXNet.mx.Context[CPU0]
INFO: Initializing parameters...
INFO: Creating KVStore...
INFO: TempSpace: Total 13 MB allocated on CPU0
INFO: Start training...

signal (11): Segmentation fault
while loading /home/rick/tmp/mnist/MXNet/lenet-stn.jl, in expression starting on line 64
Segmentation fault (core dumped)
phinzphinz commented 6 years ago

I have exactly the same problem with this versioninfo():

Julia Version 0.6.1
Commit 0d7248e2ff (2017-10-24 22:15 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-7600 CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

It always segfaults when I start training, it is not a problem of the lenet-stn.jl file. THIS IS REALLY ANNOYING. I have tried everything for the whole weekend: installing MXNet.jl in both ways, reinstalling Julia, compiling Julia from source, recompiling incubator-mxnet MANY times with different configurations and I have even reinstalled my Debian twice (fresh install wiping the whole SSD) but the problem is still there.

My Debian version is 9.3 I installed CUDA with the runfile cuda_9.0.176_384.81_linux.run and CUDNN with both methods (1: using the three .deb files from NVIDIA and also 2: copying the relevant files form cudnn-9.0-linux-x64-v7.tgz as described here). The CUDA samples work without any problems and also the CUDNN samples work, so I think I have set up CUDA and CUDNN correctly. I have a ZOTAC 1080 TI GPU, but was not able to use it yet because of this problem :( . I have tried so many things that I cannot say this for sure, but I think It the problem was not there, when I disabled CUDA in the incubator-mxnet/make/config.mk file!!! So I think, it has something to do with CUDA support. If libmxnet.so is built with CUDA support, it segfaults. My last try was this config.mk for incubator-mxnet:


#-------------------------------------------------------------------------------
#  Template configuration for compiling mxnet
#
#  If you want to change the configuration, please use the following
#  steps. Assume you are on the root directory of mxnet. First copy the this
#  file so that any local changes will be ignored by git
#
#  $ cp make/config.mk .
#
#  Next modify the according entries, and then compile by
#
#  $ make
#
#  or build in parallel with 8 threads
#
#  $ make -j8
#-------------------------------------------------------------------------------

#---------------------
# choice of compiler
#--------------------

export CC = gcc
export CXX = g++
export NVCC = nvcc

# whether compile with options for MXNet developer
DEV = 0

# whether compile with debug
DEBUG = 1

# whether compile with profiler
USE_PROFILER =

# whether to turn on signal handler (e.g. segfault logger)
USE_SIGNAL_HANDLER = 1

# the additional link flags you want to add
ADD_LDFLAGS =

# the additional compile flags you want to add
ADD_CFLAGS =

#---------------------------------------------
# matrix computation libraries for CPU/GPU
#---------------------------------------------

# whether use CUDA during compile
USE_CUDA = 1

# add the path to CUDA library to link and compile flag
# if you have already add them to environment variable, leave it as NONE
# USE_CUDA_PATH = /usr/local/cuda
USE_CUDA_PATH = /usr/local/cuda-9.0/

# whether use CuDNN R3 library
USE_CUDNN = 0

#whether to use NCCL library
USE_NCCL = 0
#add the path to NCCL library
USE_NCCL_PATH = NONE

# whether use opencv during compilation
# you can disable it, however, you will not able to use
# imbin iterator
USE_OPENCV = 0

#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
#add the path to libjpeg-turbo library
USE_LIBJPEG_TURBO_PATH = NONE

# use openmp for parallelization
USE_OPENMP = 1

# MKL ML Library for Intel CPU/Xeon Phi
# Please refer to MKL_README.md for details

# MKL ML Library folder, need to be root for /usr/local
# Change to User Home directory for standard user
# For USE_BLAS!=mkl only
MKLML_ROOT=/usr/local

# whether use MKL2017 library
USE_MKL2017 = 0

# whether use MKL2017 experimental feature for high performance
# Prerequisite USE_MKL2017=1
USE_MKL2017_EXPERIMENTAL = 0

# whether use NNPACK library
USE_NNPACK = 0

# choose the version of blas you want to use
# can be: mkl, blas, atlas, openblas
# in default use atlas for linux while apple for osx
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S), Darwin)
USE_BLAS = apple
else
USE_BLAS = atlas
endif

# whether use lapack during compilation
# only effective when compiled with blas versions openblas/apple/atlas/mkl
USE_LAPACK = 0

# path to lapack library in case of a non-standard installation
USE_LAPACK_PATH =

# by default, disable lapack when using MKL
# switch on when there is a full installation of MKL available (not just MKL2017/MKL_ML)
ifeq ($(USE_BLAS), mkl)
USE_LAPACK = 0
endif

# add path to intel library, you may need it for MKL, if you did not add the path
# to environment variable
USE_INTEL_PATH = NONE

# If use MKL only for BLAS, choose static link automatically to allow python wrapper
ifeq ($(USE_MKL2017), 0)
ifeq ($(USE_BLAS), mkl)
USE_STATIC_MKL = 1
endif
else
USE_STATIC_MKL = NONE
endif

#----------------------------
# Settings for power and arm arch
#----------------------------
ARCH := $(shell uname -a)
ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64))
    USE_SSE=0
else
    USE_SSE=1
endif

#----------------------------
# distributed computing
#----------------------------

# whether or not to enable multi-machine supporting
USE_DIST_KVSTORE = 0

# whether or not allow to read and write HDFS directly. If yes, then hadoop is
# required
USE_HDFS = 0

# path to libjvm.so. required if USE_HDFS=1
LIBJVM=$(JAVA_HOME)/jre/lib/amd64/server

# whether or not allow to read and write AWS S3 directly. If yes, then
# libcurl4-openssl-dev is required, it can be installed on Ubuntu by
# sudo apt-get install -y libcurl4-openssl-dev
USE_S3 = 0

#----------------------------
# performance settings
#----------------------------
# Use operator tuning
USE_OPERATOR_TUNING = 1

# Use gperftools if found
USE_GPERFTOOLS = 0

# Use JEMalloc if found, and not using gperftools
USE_JEMALLOC = 0

#----------------------------
# additional operators
#----------------------------

# path to folders containing projects specific operators that you don't want to put in src/operators
EXTRA_OPERATORS =

#----------------------------
# other features
#----------------------------

# Create C++ interface package
USE_CPP_PACKAGE = 0

#----------------------------
# plugins
#----------------------------

# whether to use caffe integration. This requires installing caffe.
# You also need to add CAFFE_PATH/build/lib to your LD_LIBRARY_PATH
# CAFFE_PATH = $(HOME)/caffe
# MXNET_PLUGINS += plugin/caffe/caffe.mk

# whether to use torch integration. This requires installing torch.
# You also need to add TORCH_PATH/install/lib to your LD_LIBRARY_PATH
# TORCH_PATH = $(HOME)/torch
# MXNET_PLUGINS += plugin/torch/torch.mk

# WARPCTC_PATH = $(HOME)/warp-ctc
# MXNET_PLUGINS += plugin/warpctc/warpctc.mk

# whether to use sframe integration. This requires build sframe
# git@github.com:dato-code/SFrame.git
# SFRAME_PATH = $(HOME)/SFrame
# MXNET_PLUGINS += plugin/sframe/plugin.mk

And it gives a bit more info about the segfault. julia lenet-stn.jl returns

--2017-12-10 17:29:38--  http://data.mxnet.io/mxnet/data/mnist.zip
Auflösen des Hostnamens »data.mxnet.io (data.mxnet.io)« … 54.208.175.7
Verbindungsaufbau zu data.mxnet.io (data.mxnet.io)|54.208.175.7|:80 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 11595270 (11M) [application/zip]
Wird in »»mnist.zip«« gespeichert.

mnist.zip                                            100%[=====================================================================================================================>]  11,06M   314KB/s    in 36s     

2017-12-10 17:30:14 (318 KB/s) - »»mnist.zip«« gespeichert [11595270/11595270]

Archive:  mnist.zip
  inflating: t10k-images-idx3-ubyte  
  inflating: t10k-labels-idx1-ubyte  
  inflating: train-images-idx3-ubyte  
  inflating: train-labels-idx1-ubyte  
INFO: Start training on MXNet.mx.Context[CPU0]
INFO: Initializing parameters...
INFO: Creating KVStore...
INFO: TempSpace: Total 13 MB allocated on CPU0
INFO: Start training...

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /opt/incubator-mxnet/lib/libmxnet.so(_ZN5mxnet15segfault_loggerEi+0x44) [0x7fe091483e1e]
[bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x33030) [0x7fe0f2b59030]
[bt] (2) /opt/incubator-mxnet/lib/libmxnet.so(_ZN7mshadow24BilinearSamplingBackwardIfEEvRKNS_6TensorINS_3cpuELi4ET_EERKNS1_IS2_Li3ES3_EES6_S6_+0x683) [0x7fe091358084]
[bt] (3) /opt/incubator-mxnet/lib/libmxnet.so(_ZN5mxnet2op20SpatialTransformerOpIN7mshadow3cpuEfE8BackwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EESD_SD_RKS8_INS_9OpReqTypeESaISE_EESD_SD_+0x403) [0x7fe0913404bf]
[bt] (4) /opt/incubator-mxnet/lib/libmxnet.so(_ZN5mxnet2op13OperatorState8BackwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS6_EERKS5_INS_9OpReqTypeESaISB_EESA_+0x473) [0x7fe090d6f641]
[bt] (5) /opt/incubator-mxnet/lib/libmxnet.so(_ZN5mxnet2op16LegacyOpBackwardERKNS_10OpStatePtrERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS8_EERKS7_INS_9OpReqTypeESaISD_EESC_+0x4b) [0x7fe090d695dc]
[bt] (6) /opt/incubator-mxnet/lib/libmxnet.so(_ZNSt17_Function_handlerIFvRKN5mxnet10OpStatePtrERKNS0_9OpContextERKSt6vectorINS0_5TBlobESaIS8_EERKS7_INS0_9OpReqTypeESaISD_EESC_EPSI_E9_M_invokeERKSt9_Any_dataS3_S6_SC_SH_SC_+0x91) [0x7fe090d74ea3]
[bt] (7) /opt/incubator-mxnet/lib/libmxnet.so(_ZNKSt8functionIFvRKN5mxnet10OpStatePtrERKNS0_9OpContextERKSt6vectorINS0_5TBlobESaIS8_EERKS7_INS0_9OpReqTypeESaISD_EESC_EEclES3_S6_SC_SH_SC_+0xa6) [0x7fe090dbc372]
[bt] (8) /opt/incubator-mxnet/lib/libmxnet.so(_ZN5mxnet4exec23StatefulComputeExecutor3RunENS_10RunContextEb+0x91) [0x7fe09140dbdf]
[bt] (9) /opt/incubator-mxnet/lib/libmxnet.so(+0x3ced91d) [0x7fe0913f691d]

I hope that this helps to solve the problem. I am busy this week, so I can only do more tests next weekend.

phinzphinz commented 6 years ago

Furthermore, I think that it is a MXNet.jl or Julia related issue because one time (and I don't know the config.mk configuration anymore but it was with manual ENV["MXNET_HOME"]=... setting) it worked a bit: I could train on the GPU but only until I loaded using Plots. After using Plots, the training had a buggy behaviour (the MSE() exploded after a few steps and then all weights were NA) for SGD optimizer but the Adagrad optimizer still worked normally. When I put using Plots at the beginning, Julia already complained about train_provider = mx.ArrayDataProvider(:data=>trainx, :linreg_label=>trainy, batch_size=100000,shuffle=true), I think the error was something about readonly memory? After a fresh Debian installation, it didn't even work anymore without using Plots but always had segfaults when training.

iblislin commented 6 years ago

Here is my gdb trace:

Thread 37 "julia" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff35a15700 (LWP 13819)]
0x00007fff83e77ba0 in mshadow::BilinearSamplingBackward<float> (input_grad=..., grid_src_data=..., output_grad=..., 
    input_data=...) at src/operator/spatial_transformer.cc:120
120                   *(g_input + data_index + 1) += *(grad + grad_index) * top_left_y_w
(gdb) bt
#0  0x00007fff83e77ba0 in mshadow::BilinearSamplingBackward<float> (input_grad=..., grid_src_data=..., output_grad=..., 
    input_data=...) at src/operator/spatial_transformer.cc:120
#1  0x00007fff83e5f18c in mxnet::op::SpatialTransformerOp<mshadow::cpu, float>::Backward (this=0x38bcd30, ctx=..., 
    out_grad=std::vector of length 1, capacity 1 = {...}, in_data=std::vector of length 2, capacity 2 = {...}, 
    out_data=std::vector of length 3, capacity 3 = {...}, req=std::vector of length 2, capacity 2 = {...}, 
    in_grad=std::vector of length 2, capacity 2 = {...}, aux_args=std::vector of length 0, capacity 0)
    at src/operator/./spatial_transformer-inl.h:136

I guess there is something wrong in shape.

iblislin commented 6 years ago
(gdb) p grad
$1 = (const float *) 0x7fff251e6f90
(gdb) p top_left_y_w
$2 = 0.376614928
(gdb) p grad_index
$3 = 0
(gdb) p *(grad + grad_index)                                                                                              
$4 = 0.00177509966
(gdb) p g_input + data_index + 1
$5 = (float *) 0x80032442cf50
(gdb) p g_input
$6 = (float *) 0x7fff2442cf50
(gdb) p data_index
$7 = 4294967295

oh.. data_index is weird....

iblislin commented 6 years ago

I can reproduce the segfault via change optimizer to adam, which is sample as our code.

% ./train_mnist.py --network lenet --add_stn --optimizer adam
INFO:root:start with arguments Namespace(add_stn=True, batch_size=64, disp_batches=100, dtype='float32', gpus=None, kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='lenet', num_classes=10, num_epochs=20, num_examples=60000, num_layers=None, optimizer='adam', test_io=0, top_k=0, wd=0.0001)

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2559619) [0x7f642acdd619]
[bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f645935b4b0]
[bt] (2) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2527f9d) [0x7f642acabf9d]
[bt] (3) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x252a9f6) [0x7f642acae9f6]
[bt] (4) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2203f87) [0x7f642a987f87]
[bt] (5) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x1fea13b) [0x7f642a76e13b]
[bt] (6) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x1fee562) [0x7f642a772562]
[bt] (7) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x1fd0cbd) [0x7f642a754cbd]
[bt] (8) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x1fd48c1) [0x7f642a7588c1]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f6454474c80]

so... simply switch to SGD and it works

diff --git a/examples/mnist/lenet-stn.jl b/examples/mnist/lenet-stn.jl
index 23ca9de..60f2def 100644
--- a/examples/mnist/lenet-stn.jl
+++ b/examples/mnist/lenet-stn.jl
@@ -57,6 +57,6 @@
 model = mx.FeedForward(lenet, context=mx.cpu())

 # optimizer
-optimizer = mx.ADAM(lr=0.01, weight_decay=0.00001)
+optimizer = mx.SGD(lr=0.1, momentum=.9)

 # fit parameters
rickhg12hs commented 6 years ago

So, does this mean there is something wrong in libmxnet.so?

iblislin commented 6 years ago

@rickhg12hs seems ADAM make some value fall into negative, then libmxnet.so blew up.

iblislin commented 6 years ago

So, does this mean there is something wrong in libmxnet.so?

well, not exactly, IMO. Maybe libmxnet should protect itself from accepting negative input, or... maybe ADAM is too aggressive in this case.

rickhg12hs commented 6 years ago

lenet-stn.jl runs without segfaulting after the pull request edits. The accuracy after several epochs is horrible, but that is a separate issue (maybe).

iblislin commented 6 years ago

I changed momentum to 0.1 and set n_epoch=15 (do early stopping as kind of regularization), then it works fine.

optimizer = mx.SGD(lr=0.1, momentum=.1)

iblislin commented 6 years ago

🤔 ignore my post, I'm tuning other configs.

iblislin commented 6 years ago

try this? https://github.com/dmlc/MXNet.jl/pull/371/commits/8e99fa9e22fe25cdf2b16722537371489268df1e

iblislin commented 6 years ago

got this on my machine

INFO: == Epoch 020/020 ==========
INFO: ## Training summary
INFO:           accuracy = 0.9965
INFO:               time = 5.2912 seconds
INFO: ## Validation summary
INFO:           accuracy = 0.9917
INFO: Finish training on MXNet.mx.Context[GPU0]
rickhg12hs commented 6 years ago

Using the edits in 8e99fa9e22fe25cdf2b16722537371489268df1e, I get a segfault.

$ /usr/local/src/julia/julia/julia ./lenet-stn.jl 
INFO: Start training on MXNet.mx.Context[CPU0]
INFO: Initializing parameters...
INFO: Creating KVStore...
INFO: TempSpace: Total 14 MB allocated on CPU0
INFO: Start training...

signal (11): Segmentation fault
while loading /home/rick/tmp/mnist/MXNet/lenet-stn.jl, in expression starting on line 70
Segmentation fault (core dumped)
iblislin commented 6 years ago

hmm, I believe that it's a bug of libmxnet now. My GPU build will invoke cuDNN, and it works without segfault in all the cases.

iblislin commented 6 years ago

I reported this issue to upstream: https://github.com/apache/incubator-mxnet/issues/9050

adrianloy commented 6 years ago

I think it is a bug in the STN layer. I also had some issues with that, I train a model using the simple_bind API and sometimes I get SegFaults, sometimes not. Seems to be dependent on the random parameter initialization. GDB stack trace told me it was in the BilinearSamplingBackward method, same as was mentioned before here.

iblislin commented 6 years ago

@adrianloy do you have GPU and can try out cuDNN ?