Closed TriLoo closed 4 years ago
Should this bug be in the gluon-cv issue tracker?
cc @zhreshold
@leezu
I tried Makefile & make
and found no problem running the mentioned multiprocessing.Pool(...)
,. However, a build error happened when the USE_INT64_TENSOR_SIZE = 1
is enabled in config.mk
, the error info is:
In file included from cpp-package/include/mxnet-cpp/optimizer.hpp:37:0,
from cpp-package/include/mxnet-cpp/MxNetCpp.h:35,
from cpp-package/example/mlp.cpp:26:
cpp-package/include/mxnet-cpp/op.h:3511:22: error: 'begin' has not been declared
begin,
^~~~~
cpp-package/include/mxnet-cpp/op.h:3512:22: error: 'end' has not been declared
end,
^~~
cpp-package/include/mxnet-cpp/op.h:3513:22: error: 'step' has not been declared
step = Shape()) {
^~~~
cpp-package/include/mxnet-cpp/op.h:3513:29: error: could not convert 'mxnet::cpp::Shape()' from 'mxnet::cpp::Shape' to 'int'
step = Shape()) {
^~~~~~~
cpp-package/include/mxnet-cpp/op.h: In function 'mxnet::cpp::Symbol mxnet::cpp::slice(const string&, mxnet::cpp::Symbol, int, int, int)':
cpp-package/include/mxnet-cpp/op.h:3515:31: error: 'begin' was not declared in this scope
.SetParam("begin", begin)
...
I'm going to check the flag difference between config.mk & Makefile
and CMakeLists.txt
, or instead debug the begin, end
not declared error.
You can also try delete the 3rdparty/openmp and use the cmake build.
I will try this later ~
I used system openmp instead of 3rdparty/openmp
, the problem still exists. when running into multiprocessing.Pool(...)
, the CPU is 100% used but always cannot continue execution @leezu
Thanks for confirming this. You mentioned before that the bug does not exist when compiling with Makefile. Are you certain about that? If so, can you provide a minimal reproducible example?
Sure, when I compile it using Makefile
, this bug doesn't exist, but the USE_INT64_TENSOR_SIZE =1
would raise the begin,end
not declared error, i found the types of these variables in corresponding function declaration is indeed missed.
According to the CMakeLists.txt, the system openmp is used when the MKL
is used as the BLAS library by default, here: openmp-mkl enabled
Steps to reproduce: I downloaded the latest source code from master branch, compiled it with cmake 3.16.6 and g++ 7.3.1
, and latest gluon-cv package, then training the YOLOv3 with random input shape enabled would cause this bug. I think I can share my CMakeLists.txt
to you if you needed it. Any other info I can provide, let me know ~
I found that the shared lib libmxnet.so
generate by CMakeLists.txt
did not depend on libiomp5.so
, but depend on libgomp.so
only. meanwhile, the shared lib generate by Makefile
depend on both libiomp
and libgomp
, but not depend on libmkl_rt.so
althrough I set USE_MKL=mkl
in config.mk
. I am not sure if this caused the dataloader
difference. @leezu
I suspect you're looking at transitive dependencies instead of the actual dependencies of libmxnet. Please use readelf -d libmxnet.so | grep NEEDED
to check the direct dependencies.
For cmake, if you like to try without mkl, be sure to call cmake with cmake -DUSE_MKL_IF_AVAILABLE=0
The readelf -d libmxnet.so | grep NEEDED
outputs:
CMake
:
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libmkl_rt.so]
0x0000000000000001 (NEEDED) Shared library: [librt.so.1]
0x0000000000000001 (NEEDED) Shared library: [libopencv_highgui.so.3.4]
0x0000000000000001 (NEEDED) Shared library: [libopencv_videoio.so.3.4]
0x0000000000000001 (NEEDED) Shared library: [libopencv_imgcodecs.so.3.4]
0x0000000000000001 (NEEDED) Shared library: [libopencv_imgproc.so.3.4]
0x0000000000000001 (NEEDED) Shared library: [libopencv_core.so.3.4]
0x0000000000000001 (NEEDED) Shared library: [liblapack.so.3]
0x0000000000000001 (NEEDED) Shared library: [libcudnn.so.7]
0x0000000000000001 (NEEDED) Shared library: [libcudart.so.10.0]
0x0000000000000001 (NEEDED) Shared library: [libcufft.so.10.0]
0x0000000000000001 (NEEDED) Shared library: [libcublas.so.10.0]
0x0000000000000001 (NEEDED) Shared library: [libcusolver.so.10.0]
0x0000000000000001 (NEEDED) Shared library: [libcusparse.so.10.0]
0x0000000000000001 (NEEDED) Shared library: [libcurand.so.10.0]
0x0000000000000001 (NEEDED) Shared library: [libnvrtc.so.10.0]
0x0000000000000001 (NEEDED) Shared library: [libcuda.so.1]
0x0000000000000001 (NEEDED) Shared library: [libnvidia-ml.so.1]
0x0000000000000001 (NEEDED) Shared library: [libnvToolsExt.so.1]
0x0000000000000001 (NEEDED) Shared library: [libgomp.so.1]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2]
Makefile
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libcudart.so.10.0]
0x0000000000000001 (NEEDED) Shared library: [libcublas.so.10.0]
0x0000000000000001 (NEEDED) Shared library: [libcurand.so.10.0]
0x0000000000000001 (NEEDED) Shared library: [libcusolver.so.10.0]
0x0000000000000001 (NEEDED) ==Shared library: [libiomp5.so]==
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [librt.so.1]
0x0000000000000001 (NEEDED) Shared library: [libopencv_imgcodecs.so.3.4]
0x0000000000000001 (NEEDED) Shared library: [libopencv_highgui.so.3.4]
0x0000000000000001 (NEEDED) Shared library: [libopencv_imgproc.so.3.4]
0x0000000000000001 (NEEDED) Shared library: [libopencv_core.so.3.4]
0x0000000000000001 (NEEDED) Shared library: [liblapack.so.3]
0x0000000000000001 (NEEDED) Shared library: [libcudnn.so.7]
0x0000000000000001 (NEEDED) Shared library: [libcufft.so.10.0]
0x0000000000000001 (NEEDED) Shared library: [libcuda.so.1]
0x0000000000000001 (NEEDED) Shared library: [libnvrtc.so.10.0]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgomp.so.1]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2]
Actually, I want to use MKL
but looks like the Makefile
did not work. I am going to enable libiomp5.so
to check out if it is the source of problem.
@TriLoo I found myself that the issue is not reproducible as I am able to pass throught the pool creation code. However, due to latest weakref change(https://github.com/apache/incubator-mxnet/pull/18328).
I build from source using cmake, and from the attached log you will notice that the training already starts without stucking
[23:45:38] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7501, which is older than the oldest version tested by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
learning rate from ``lr_scheduler`` has been overwritten by ``learning_rate`` in optimizer.
INFO:root:Namespace(amp=False, batch_size=64, data_shape=416, dataset='voc', epochs=200, gpus='0,1,2,3,4,5,6,7', horovod=False, label_smooth=False, log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch='160,180', lr_decay_period=0, lr_mode='step', mixup=False, momentum=0.9, network='darknet53', no_mixup_epochs=20, no_random_shape=False, no_wd=False, num_samples=16551, num_workers=16, resume='', save_interval=10, save_prefix='yolo3_darknet53_voc', seed=233, start_epoch=0, syncbn=True, val_interval=1, warmup_epochs=4, warmup_lr=0.0, wd=0.0005)
INFO:root:Start training from [Epoch 0]
Traceback (most recent call last):
File "train_yolo3.py", line 374, in <module>
train(net, train_data, val_data, eval_metric, ctx, args)
File "train_yolo3.py", line 270, in train
for i, batch in enumerate(train_data):
File "/home/ubuntu/mxnet/python/mxnet/gluon/data/dataloader.py", line 485, in __next__
batch = pickle.loads(ret.get(self._timeout))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: can't pickle weakref objects
thanks for your reply @zhreshold .
I tried again and the problem on my machine still exists, through the libiomp5
is added. The training log does not start in my case actually.
I will close this issue temporarily and add the details if I find out the cause of the problem on my machine ~
Looks like some modification errors to FindMKL.cmake
and the missed libiomp5
caused this error, now it works.
Description
I installed mxnet using
cmake
, followed the tutorial: get started-mx. However, the program cannot continue execution when using followed code:the code is intercepted from here - yolov3
Error Message
No error message, the process is still running but cannot step out the above
Pool()
function.To Reproduce
Steps to reproduce
CUDA, MKLDNN, OpenMP
enabledtrain_yolov3.sh
from gluon-cvWhat have you tried to solve it?
pip install mxnet-cu100mkl
can step out themultiprocessing.Pool()
and continue executioncmake
andpip install --user -e .
, then cannot step out.cmake
, theDataloader
can work through withnum_workers>0
, but its derived class cannot.Environment