apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Custom-Op Bug when using multiple custom-ops #4521

Closed lyttonhao closed 7 years ago

lyttonhao commented 7 years ago

I found that when using multiple output custom-ops, the program got stuck. It seems that the engine is suffering from the deadlock. This problem will occur when the custom-op contains the codes like `mx.nd.xx( xx ).asnumpy()'. This problem does not occur when using NaiveEngine.

I have written an example to reproduce this bug. You can put this file on the path of 'exmple/numpy-ops' and then run it. If we add line 15, the program will get stuck. Otherwise it works fine.

MXNet version: test two versions.

  1. the newest master: ceb9f0187a31d528e5566f810d933cf4834d3282
  2. an older master: 01cde15b5611d3add9a103abd7979c3272693625
sxjscience commented 7 years ago

I've tried and the script runs well. I'm using the latest dmlc/master. Also, I've compiled using the master version of nnvm + mshadow + dmlc-core The script runs well on my windows build and stuck on my linux build... Also, I've tried ThreadedEngine + ThreadedEnginePerDevice Log

(C:\Anaconda2) D:\HKUST\mxnet\example\numpy-ops>python nnvm_customop_bug.py
[23:19:24] D:\HKUST\mxnet\src\io\iter_mnist.cc:91: MNISTIter: load 60000 images, shuffle=1, shape=(100,784)
[23:19:24] D:\HKUST\mxnet\src\engine\engine.cc:36: MXNet start using engine: ThreadedEnginePerDevice
[23:19:24] D:\HKUST\mxnet\src\io\iter_mnist.cc:91: MNISTIter: load 10000 images, shuffle=1, shape=(100,784)
WARNING:root:[Deprecation Warning] mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.
INFO:root:Start training with [gpu(0)]
INFO:root:Epoch[0] Batch [50]   Speed: 31645.57 samples/sec     Train-multi-accuracy_0=0.534000
INFO:root:Epoch[0] Batch [50]   Speed: 31645.57 samples/sec     Train-multi-accuracy_1=0.534000
INFO:root:Epoch[0] Batch [100]  Speed: 32051.30 samples/sec     Train-multi-accuracy_0=0.850400
INFO:root:Epoch[0] Batch [100]  Speed: 32051.30 samples/sec     Train-multi-accuracy_1=0.850400
INFO:root:Epoch[0] Batch [150]  Speed: 31249.98 samples/sec     Train-multi-accuracy_0=0.887400
INFO:root:Epoch[0] Batch [150]  Speed: 31249.98 samples/sec     Train-multi-accuracy_1=0.887400
INFO:root:Epoch[0] Batch [200]  Speed: 30674.83 samples/sec     Train-multi-accuracy_0=0.894000
INFO:root:Epoch[0] Batch [200]  Speed: 30674.83 samples/sec     Train-multi-accuracy_1=0.894000
INFO:root:Epoch[0] Batch [250]  Speed: 31055.90 samples/sec     Train-multi-accuracy_0=0.905000
INFO:root:Epoch[0] Batch [250]  Speed: 31055.90 samples/sec     Train-multi-accuracy_1=0.905000
INFO:root:Epoch[0] Batch [300]  Speed: 30674.83 samples/sec     Train-multi-accuracy_0=0.909400
INFO:root:Epoch[0] Batch [300]  Speed: 30674.83 samples/sec     Train-multi-accuracy_1=0.909400
INFO:root:Epoch[0] Batch [350]  Speed: 31446.56 samples/sec     Train-multi-accuracy_0=0.916000
piiswrong commented 7 years ago

os.environ["MXNET_CPU_WORKER_NTHREADS"] = "4" Add this to the beginning before importing mxnet

lyttonhao commented 7 years ago

It has been fixed by #4528

coconutyao commented 5 years ago

os.environ["MXNET_CPU_WORKER_NTHREADS"] = "4" Add this to the beginning before importing mxnet

Need some help, Thank you! Deadlock happend while calling MXNDArraySyncCopyToCPU() ?