apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

train_cifar10.py hangs on first epoch in debug mode (4 P100 GPUs) #10123

Closed pakmarkthub closed 5 years ago

pakmarkthub commented 6 years ago

Description

train_cifar10.py hangs before outputting validation-accuracy on the first epoch when using four P100 GPUs, compiled with debug=1 or MXNET_ENGINE_TYPE=NaiveEngine (recompile with debug=0 in the later case).

Environment info (Required)

----------Python Info---------- ('Version :', '2.7.5') ('Compiler :', 'GCC 4.8.5 20150623 (Red Hat 4.8.5-16)') ('Build :', ('default', 'Aug 4 2017 00:39:18')) ('Arch :', ('64bit', 'ELF')) ------------Pip Info----------- ('Version :', '9.0.1') ('Directory :', '/usr/lib/python2.7/site-packages/pip') ----------MXNet Info----------- ('Version :', '1.1.0') ('Directory :', '/home/cc/Projects/uvm-reduce/experiments/mxnet/original/python/mxnet') Hashtag not found. Not installed from pre-built package. ----------System Info---------- ('Platform :', 'Linux-3.10.0-514.6.1.el7.x86_64-x86_64-with-centos-7.4.1708-Core') ('system :', 'Linux') ('node :', 'ftg-p100.novalocal') ('release :', '3.10.0-514.6.1.el7.x86_64') ('version :', '#1 SMP Wed Jan 18 13:06:36 UTC 2017') ----------Hardware Info---------- ('machine :', 'x86_64') ('processor :', 'x86_64') Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 56 On-line CPU(s) list: 0-55 Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz Stepping: 1 CPU MHz: 2499.921 CPU max MHz: 3200.0000 CPU min MHz: 1200.0000 BogoMIPS: 3999.88 Virtualization: VT-x Hypervisor vendor: vertical Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 35840K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0129 sec, LOAD: 0.7237 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0127 sec, LOAD: 0.2568 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0640 sec, LOAD: 0.2466 sec. Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0161 sec, LOAD: 0.0744 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0646 sec, LOAD: 0.2254 sec. Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0478 sec, LOAD: 0.4038 sec. -----------CUDA---------------- CUDA Driver Version / Runtime Version: 9.1 / 9.1 Driver version: 390.30 GPUs: four Tesla P100-SXM2-16GB GPUs Peer access: Yes (all GPU pairs) -----------OpenCv-------------- Version: 3.4.1

Package used (Python/R/Scala/Julia): I'm using python2.7.5

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio): gcc

MXNet commit hash: 07a83a0325a3d782513a04f47d711710972cb144 (tag 1.1.0) f7fa49e3068ab30a7379bbce1b288bf285fe5cde (master)

Build config:

#--------------------- # choice of compiler #--------------------

export CC = gcc export CXX = g++ export NVCC = nvcc

# whether compile with options for MXNet developer DEV = 0

# whether compile with debug DEBUG = 1

# whether compile with profiler USE_PROFILER =

# whether to turn on segfault signal handler to log the stack trace USE_SIGNAL_HANDLER =

# the additional link flags you want to add ADD_LDFLAGS = -lopencv_core -lopencv_imgcodecs -lopencv_imgproc

# the additional compile flags you want to add ADD_CFLAGS =

#--------------------------------------------- # matrix computation libraries for CPU/GPU #---------------------------------------------

# whether use CUDA during compile USE_CUDA = 1

# add the path to CUDA library to link and compile flag # if you have already add them to environment variable, leave it as NONE # USE_CUDA_PATH = /usr/local/cuda USE_CUDA_PATH = NONE

# whether to enable CUDA runtime compilation ENABLE_CUDA_RTC = 1

# whether use CuDNN R3 library USE_CUDNN = 1

#whether to use NCCL library USE_NCCL = 1 #add the path to NCCL library USE_NCCL_PATH = NONE

# whether use opencv during compilation # you can disable it, however, you will not able to use # imbin iterator USE_OPENCV = 1

#whether use libjpeg-turbo for image decode without OpenCV wrapper USE_LIBJPEG_TURBO = 0 #add the path to libjpeg-turbo library USE_LIBJPEG_TURBO_PATH = NONE

# use openmp for parallelization USE_OPENMP = 1

# whether use MKL-DNN library USE_MKLDNN = 0

# whether use NNPACK library USE_NNPACK = 0

# choose the version of blas you want to use # can be: mkl, blas, atlas, openblas # in default use atlas for linux while apple for osx UNAME_S := $(shell uname -s) ifeq ($(UNAME_S), Darwin) USE_BLAS = apple else USE_BLAS = atlas endif

# whether use lapack during compilation # only effective when compiled with blas versions openblas/apple/atlas/mkl USE_LAPACK = 1

# path to lapack library in case of a non-standard installation USE_LAPACK_PATH =

# add path to intel library, you may need it for MKL, if you did not add the path # to environment variable USE_INTEL_PATH = NONE

# If use MKL only for BLAS, choose static link automatically to allow python wrapper ifeq ($(USE_BLAS), mkl) USE_STATIC_MKL = 1 else USE_STATIC_MKL = NONE endif

#---------------------------- # Settings for power and arm arch #---------------------------- ARCH := $(shell uname -a) ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64)) USE_SSE=0 else USE_SSE=1 endif

#---------------------------- # distributed computing #----------------------------

# whether or not to enable multi-machine supporting USE_DIST_KVSTORE = 0

# whether or not allow to read and write HDFS directly. If yes, then hadoop is # required USE_HDFS = 0

# path to libjvm.so. required if USE_HDFS=1 LIBJVM=$(JAVA_HOME)/jre/lib/amd64/server

# whether or not allow to read and write AWS S3 directly. If yes, then # libcurl4-openssl-dev is required, it can be installed on Ubuntu by # sudo apt-get install -y libcurl4-openssl-dev USE_S3 = 0

#---------------------------- # performance settings #---------------------------- # Use operator tuning USE_OPERATOR_TUNING = 1

# Use gperftools if found USE_GPERFTOOLS = 1

# Use JEMalloc if found, and not using gperftools USE_JEMALLOC = 1

#---------------------------- # additional operators #----------------------------

# path to folders containing projects specific operators that you don't want to put in src/operators EXTRA_OPERATORS =

#---------------------------- # other features #----------------------------

# Create C++ interface package USE_CPP_PACKAGE = 0

#---------------------------- # plugins #----------------------------

# whether to use caffe integration. This requires installing caffe. # You also need to add CAFFE_PATH/build/lib to your LD_LIBRARY_PATH # CAFFE_PATH = $(HOME)/caffe # MXNET_PLUGINS += plugin/caffe/caffe.mk

# WARPCTC_PATH = $(HOME)/warp-ctc # MXNET_PLUGINS += plugin/warpctc/warpctc.mk

# whether to use sframe integration. This requires build sframe # git@github.com:dato-code/SFrame.git # SFRAME_PATH = $(HOME)/SFrame # MXNET_PLUGINS += plugin/sframe/plugin.mk

Error Message:

(No error message, just hang up, below is the out) INFO:root:start with arguments Namespace(batch_size=128, benchmark=0, data_nthreads=4, data_train='data/cifar10_train.rec', data_train_idx='', data_val='data/cifar10_val.rec', data_val_idx='', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0,1,2,3', image_shape='3,28,28', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet', num_classes=10, num_epochs=300, num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001) [16:30:36] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_train.rec, use 4 threads for decoding.. [16:30:42] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar10_val.rec, use 4 threads for decoding.. [16:30:43] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) INFO:root:Epoch[0] Batch [20] Speed: 629.17 samples/sec accuracy=0.120536 INFO:root:Epoch[0] Batch [40] Speed: 630.63 samples/sec accuracy=0.172656 INFO:root:Epoch[0] Batch [60] Speed: 645.72 samples/sec accuracy=0.205859 INFO:root:Epoch[0] Batch [80] Speed: 642.09 samples/sec accuracy=0.224219 INFO:root:Epoch[0] Batch [100] Speed: 634.74 samples/sec accuracy=0.250781 INFO:root:Epoch[0] Batch [120] Speed: 628.02 samples/sec accuracy=0.264453 INFO:root:Epoch[0] Batch [140] Speed: 629.10 samples/sec accuracy=0.288281 INFO:root:Epoch[0] Batch [160] Speed: 632.72 samples/sec accuracy=0.299609 INFO:root:Epoch[0] Batch [180] Speed: 636.01 samples/sec accuracy=0.320312 INFO:root:Epoch[0] Batch [200] Speed: 640.49 samples/sec accuracy=0.337891 INFO:root:Epoch[0] Batch [220] Speed: 640.68 samples/sec accuracy=0.353906 INFO:root:Epoch[0] Batch [240] Speed: 639.87 samples/sec accuracy=0.362109 INFO:root:Epoch[0] Batch [260] Speed: 637.37 samples/sec accuracy=0.370312 INFO:root:Epoch[0] Batch [280] Speed: 638.09 samples/sec accuracy=0.376953 INFO:root:Epoch[0] Batch [300] Speed: 635.81 samples/sec accuracy=0.395313 INFO:root:Epoch[0] Batch [320] Speed: 638.06 samples/sec accuracy=0.405078 INFO:root:Epoch[0] Batch [340] Speed: 638.94 samples/sec accuracy=0.417969 INFO:root:Epoch[0] Batch [360] Speed: 639.11 samples/sec accuracy=0.430469 INFO:root:Epoch[0] Batch [380] Speed: 635.67 samples/sec accuracy=0.446484 INFO:root:Epoch[0] Train-accuracy=0.450781 INFO:root:Epoch[0] Time cost=80.754 ----------- Hang up here -------------------------

(Stack trace from gdb after hang)

0 0x00007ffff77fe945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

1 0x00007fffe8a29a6c in std::condition_variable::wait(std::unique_lock&) () from /lib64/libstdc++.so.6

2 0x00007fff137c4435 in std::condition_variable::wait<mxnet::engine::ThreadedEngine::WaitForVar(mxnet::Engine::VarHandle)::__lambda21>(std::unique_lock &, mxnet::engine::ThreadedEngine::lambda21) (this=0x7ffebc001198, lock=..., __p=...)

at /usr/include/c++/4.8.2/condition_variable:93

3 0x00007fff137c3fa2 in mxnet::engine::ThreadedEngine::WaitForVar (this=0x7ffebc001150, var=0xc9d9e4c8) at src/engine/threaded_engine.cc:384

4 0x00007fff13034e5e in mxnet::NDArray::WaitToWrite (this=0xc9bc74f0) at include/mxnet/./ndarray.h:361

5 0x00007fff131b5f50 in mxnet::NDArray::SyncCopyToCPU (this=0xc9bc74f0, data=0xc88bde50, size=32) at src/ndarray/ndarray.cc:1290

6 0x00007fff13809ac9 in MXNDArraySyncCopyToCPU (handle=0xc9bc74f0, data=0xc88bde50, size=32) at src/c_api/c_api.cc:261

7 0x00007fffeee8fdcc in ffi_call_unix64 () from /lib64/libffi.so.6

8 0x00007fffeee8f6f5 in ffi_call () from /lib64/libffi.so.6

9 0x00007fffef0a2c8b in _ctypes_callproc () from /usr/lib64/python2.7/lib-dynload/_ctypes.so

10 0x00007fffef09ca85 in PyCFuncPtr_call () from /usr/lib64/python2.7/lib-dynload/_ctypes.so

11 0x00007ffff7a5a9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0

12 0x00007ffff7aef0f6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

13 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0

14 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

15 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0

16 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

17 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0

18 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

19 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0

20 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

21 0x00007ffff7af357d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

22 0x00007ffff7af357d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

23 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0

24 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

25 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0

26 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

27 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0

28 0x00007ffff7af33fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

29 0x00007ffff7af5efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0

30 0x00007ffff7af6002 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0

31 0x00007ffff7b0f43f in run_mod () from /lib64/libpython2.7.so.1.0

32 0x00007ffff7b105fe in PyRun_FileExFlags () from /lib64/libpython2.7.so.1.0

33 0x00007ffff7b11889 in PyRun_SimpleFileExFlags () from /lib64/libpython2.7.so.1.0

34 0x00007ffff7b22a3f in Py_Main () from /lib64/libpython2.7.so.1.0

35 0x00007ffff6d48c05 in __libc_start_main () from /lib64/libc.so.6

36 0x000000000040071e in _start ()

Minimum reproducible example

/incubator-mxnet/example/image-classification/train_cifar10.py

Steps to reproduce

Case 1:

  1. Compile mxnet with the above config.mk (DEBUG=1).
  2. Run python train_cifar10.py --gpu 0,1,2,3
  3. The program hanged up before printing out "Validation-accuracy" on epoch 0.

Case 2:

  1. Compile mxnet with the above config.mk (DEBUG=0).
  2. MXNET_ENGINE_TYPE=NaiveEngine Run python train_cifar10.py --gpu 0,1,2,3
  3. The program hanged up before printing out "Validation-accuracy" on epoch 0.

What have you tried to solve it?

  1. Changed kv-store to nccl by passing --kv-store=nccl to train_cifar10.py <== Not helpful
  2. Tried checkout from tag 1.1.0 and master (see commit hashes above) <== Not helpful
  3. Reinstalled CUDA 9.1 <== Not helpful
  4. Downgraded NVIDIA driver to version 387.26 <== Randomly hanged after a few epochs
  5. Searched through issues on GitHub. Found some similar problems but no solution (expected to solve by the master branch).
hetong007 commented 6 years ago

Can you try the train_cifar10.py script at: http://gluon-cv.mxnet.io/model_zoo/index.html#image-classification

piyushghai commented 6 years ago

@pakmarkthub Were you able to try @hetong007 's suggestions ? Do you still see the hanging issue while training Cifar10 model ?

kalyc commented 5 years ago

Hi @pakmarkthub I used MXNet v1.1.0 on my machine and am unable to reproduce the issue. Could you please verify if the issue is still persisting at your end? I was able to run train_cifar10.py file without any issue. I was able to see the validation accuracy on epoch 0 and the program didn't hang.

kalyc commented 5 years ago

@sandeep-krishnamurthy could you close this issue? @pakmarkthub please feel free to re-open/ comment if the issue still persists at your end.

sandeep-krishnamurthy commented 5 years ago

@kalyc - Does this work with latest (v1.3)?

kalyc commented 5 years ago

This might be a related issue (RecordIO input split) - https://github.com/apache/incubator-mxnet/issues/12800 and it has been resolved

kalyc commented 5 years ago

On v1.3 here's what I got:

  File "example/image-classification/train_cifar10.py", line 76, in <module>
    fit.fit(args, sym, data.get_rec_iter)
  File "/home/ubuntu/incubator-mxnet/example/image-classification/common/fit.py", line 220, in fit
    lr, lr_scheduler = _get_lr_scheduler(args, kv)
  File "/home/ubuntu/incubator-mxnet/example/image-classification/common/fit.py", line 53, in _get_lr_scheduler
    base_lr=args.lr))
TypeError: __init__() got an unexpected keyword argument 'base_lr'