apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

mx.gluon.data.Dataset class not support multiprocessing.Pool() #18364

Closed TriLoo closed 4 years ago

TriLoo commented 4 years ago

Description

I installed mxnet using cmake, followed the tutorial: get started-mx. However, the program cannot continue execution when using followed code:

if self._num_workers > 0:
            self._worker_pool = multiprocessing.Pool(
                self._num_workers, initializer=_worker_initializer, initargs=[self._dataset])

the code is intercepted from here - yolov3

Error Message

No error message, the process is still running but cannot step out the above Pool() function.

To Reproduce

Steps to reproduce

  1. build the latest mxnet using cmake, with CUDA, MKLDNN, OpenMP enabled
  2. run the train_yolov3.sh from gluon-cv

What have you tried to solve it?

  1. the mxnet installed by pip install mxnet-cu100mkl can step out the multiprocessing.Pool() and continue execution
  2. mxnet installed by cmake and pip install --user -e ., then cannot step out.
  3. when installed using cmake, the Dataloader can work through with num_workers>0, but its derived class cannot.

Environment

----------Python Info----------
Version      : 3.6.8
Compiler     : GCC 7.3.0
Build        : ('default', 'Dec 30 2018 01:22:34')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.1.1
Directory    : /search/odin/songminghui/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 2.0.0
Directory    : /search/odin/songminghui/githubs/incubator-mxnet/python/mxnet
Num GPUs     : 8
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-3.10.0-327.el7.x86_64-x86_64-with-centos-7.2.1511-Core
system       : Linux
node         : nmyjs_176_61
release      : 3.10.0-327.el7.x86_64
version      : #1 SMP Thu Nov 19 22:10:57 UTC 2015
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2195.104
BogoMIPS:              4398.47
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47
----------Network Test----------
Setting timeout: 10
Error open MXNet: https://github.com/apache/incubator-mxnet, <urlopen error timed out>, DNS finished in 1.2967002391815186 sec.
Error open GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, <urlopen error timed out>, DNS finished in 4.1484832763671875e-05 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 1.1775 sec, LOAD: 3.5363 sec.
Timing for D2L: http://d2l.ai, DNS: 0.4321 sec, LOAD: 0.8394 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.3948 sec, LOAD: 1.0763 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.4420 sec, LOAD: 2.9380 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.1293 sec, LOAD: 13.5263 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.47435927391052246 sec.
leezu commented 4 years ago

Should this bug be in the gluon-cv issue tracker?

cc @zhreshold

TriLoo commented 4 years ago

@leezu I tried Makefile & make and found no problem running the mentioned multiprocessing.Pool(...),. However, a build error happened when the USE_INT64_TENSOR_SIZE = 1 is enabled in config.mk, the error info is:

In file included from cpp-package/include/mxnet-cpp/optimizer.hpp:37:0,
                 from cpp-package/include/mxnet-cpp/MxNetCpp.h:35,
                 from cpp-package/example/mlp.cpp:26:
cpp-package/include/mxnet-cpp/op.h:3511:22: error: 'begin' has not been declared
                      begin,
                      ^~~~~
cpp-package/include/mxnet-cpp/op.h:3512:22: error: 'end' has not been declared
                      end,
                      ^~~
cpp-package/include/mxnet-cpp/op.h:3513:22: error: 'step' has not been declared
                      step = Shape()) {
                      ^~~~
cpp-package/include/mxnet-cpp/op.h:3513:29: error: could not convert 'mxnet::cpp::Shape()' from 'mxnet::cpp::Shape' to 'int'
                      step = Shape()) {
                             ^~~~~~~
cpp-package/include/mxnet-cpp/op.h: In function 'mxnet::cpp::Symbol mxnet::cpp::slice(const string&, mxnet::cpp::Symbol, int, int, int)':
cpp-package/include/mxnet-cpp/op.h:3515:31: error: 'begin' was not declared in this scope
            .SetParam("begin", begin)
...

I'm going to check the flag difference between config.mk & Makefile and CMakeLists.txt, or instead debug the begin, end not declared error.

leezu commented 4 years ago

You can also try delete the 3rdparty/openmp and use the cmake build.

TriLoo commented 4 years ago

I will try this later ~

TriLoo commented 4 years ago

I used system openmp instead of 3rdparty/openmp, the problem still exists. when running into multiprocessing.Pool(...), the CPU is 100% used but always cannot continue execution @leezu

leezu commented 4 years ago

Thanks for confirming this. You mentioned before that the bug does not exist when compiling with Makefile. Are you certain about that? If so, can you provide a minimal reproducible example?

TriLoo commented 4 years ago

Sure, when I compile it using Makefile, this bug doesn't exist, but the USE_INT64_TENSOR_SIZE =1 would raise the begin,end not declared error, i found the types of these variables in corresponding function declaration is indeed missed.

According to the CMakeLists.txt, the system openmp is used when the MKL is used as the BLAS library by default, here: openmp-mkl enabled

Steps to reproduce: I downloaded the latest source code from master branch, compiled it with cmake 3.16.6 and g++ 7.3.1, and latest gluon-cv package, then training the YOLOv3 with random input shape enabled would cause this bug. I think I can share my CMakeLists.txt to you if you needed it. Any other info I can provide, let me know ~

TriLoo commented 4 years ago

I found that the shared lib libmxnet.so generate by CMakeLists.txt did not depend on libiomp5.so, but depend on libgomp.so only. meanwhile, the shared lib generate by Makefile depend on both libiomp and libgomp, but not depend on libmkl_rt.so althrough I set USE_MKL=mkl in config.mk. I am not sure if this caused the dataloader difference. @leezu

leezu commented 4 years ago

I suspect you're looking at transitive dependencies instead of the actual dependencies of libmxnet. Please use readelf -d libmxnet.so | grep NEEDED to check the direct dependencies.

For cmake, if you like to try without mkl, be sure to call cmake with cmake -DUSE_MKL_IF_AVAILABLE=0

TriLoo commented 4 years ago

The readelf -d libmxnet.so | grep NEEDED outputs: CMake:

 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libmkl_rt.so]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libopencv_highgui.so.3.4]
 0x0000000000000001 (NEEDED)             Shared library: [libopencv_videoio.so.3.4]
 0x0000000000000001 (NEEDED)             Shared library: [libopencv_imgcodecs.so.3.4]
 0x0000000000000001 (NEEDED)             Shared library: [libopencv_imgproc.so.3.4]
 0x0000000000000001 (NEEDED)             Shared library: [libopencv_core.so.3.4]
 0x0000000000000001 (NEEDED)             Shared library: [liblapack.so.3]
 0x0000000000000001 (NEEDED)             Shared library: [libcudnn.so.7]
 0x0000000000000001 (NEEDED)             Shared library: [libcudart.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcufft.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcublas.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcusolver.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcusparse.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcurand.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libnvrtc.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcuda.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libnvidia-ml.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libnvToolsExt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libgomp.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]

Makefile

0x0000000000000001 (NEEDED)              Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libcudart.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcublas.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcurand.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcusolver.so.10.0]
 0x0000000000000001 (NEEDED)             ==Shared library: [libiomp5.so]==
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libopencv_imgcodecs.so.3.4]
 0x0000000000000001 (NEEDED)             Shared library: [libopencv_highgui.so.3.4]
 0x0000000000000001 (NEEDED)             Shared library: [libopencv_imgproc.so.3.4]
 0x0000000000000001 (NEEDED)             Shared library: [libopencv_core.so.3.4]
 0x0000000000000001 (NEEDED)             Shared library: [liblapack.so.3]
 0x0000000000000001 (NEEDED)             Shared library: [libcudnn.so.7]
 0x0000000000000001 (NEEDED)             Shared library: [libcufft.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libcuda.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libnvrtc.so.10.0]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgomp.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]

Actually, I want to use MKL but looks like the Makefile did not work. I am going to enable libiomp5.so to check out if it is the source of problem.

zhreshold commented 4 years ago

@TriLoo I found myself that the issue is not reproducible as I am able to pass throught the pool creation code. However, due to latest weakref change(https://github.com/apache/incubator-mxnet/pull/18328).

I build from source using cmake, and from the attached log you will notice that the training already starts without stucking

[23:45:38] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7501, which is older than the oldest version tested by CI (7600).  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
INFO:root:Namespace(amp=False, batch_size=64, data_shape=416, dataset='voc', epochs=200, gpus='0,1,2,3,4,5,6,7', horovod=False, label_smooth=False, log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,180', lr_decay_period=0, lr_mode='step', mixup=False, momentum=0.9, network='darknet53', no_mixup_epochs=20, no_random_shape=False, no_wd=False, num_samples=16551, num_workers=16, resume='', save_interval=10, save_prefix='yolo3_darknet53_voc', seed=233, start_epoch=0, syncbn=True, val_interval=1, warmup_epochs=4, warmup_lr=0.0, wd=0.0005)
INFO:root:Start training from [Epoch 0]
Traceback (most recent call last):
  File "train_yolo3.py", line 374, in <module>
    train(net, train_data, val_data, eval_metric, ctx, args)
  File "train_yolo3.py", line 270, in train
    for i, batch in enumerate(train_data):
  File "/home/ubuntu/mxnet/python/mxnet/gluon/data/dataloader.py", line 485, in __next__
    batch = pickle.loads(ret.get(self._timeout))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
    put(task)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: can't pickle weakref objects
TriLoo commented 4 years ago

thanks for your reply @zhreshold . I tried again and the problem on my machine still exists, through the libiomp5 is added. The training log does not start in my case actually. I will close this issue temporarily and add the details if I find out the cause of the problem on my machine ~

TriLoo commented 4 years ago

Looks like some modification errors to FindMKL.cmake and the missed libiomp5 caused this error, now it works.