Closed lyttonhao closed 7 years ago
I've tried and the script runs well. I'm using the latest dmlc/master. Also, I've compiled using the master version of nnvm + mshadow + dmlc-core
The script runs well on my windows build and stuck on my linux build...
Also, I've tried ThreadedEngine + ThreadedEnginePerDevice
Log
(C:\Anaconda2) D:\HKUST\mxnet\example\numpy-ops>python nnvm_customop_bug.py
[23:19:24] D:\HKUST\mxnet\src\io\iter_mnist.cc:91: MNISTIter: load 60000 images, shuffle=1, shape=(100,784)
[23:19:24] D:\HKUST\mxnet\src\engine\engine.cc:36: MXNet start using engine: ThreadedEnginePerDevice
[23:19:24] D:\HKUST\mxnet\src\io\iter_mnist.cc:91: MNISTIter: load 10000 images, shuffle=1, shape=(100,784)
WARNING:root:[91m[Deprecation Warning] mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.[0m
INFO:root:Start training with [gpu(0)]
INFO:root:Epoch[0] Batch [50] Speed: 31645.57 samples/sec Train-multi-accuracy_0=0.534000
INFO:root:Epoch[0] Batch [50] Speed: 31645.57 samples/sec Train-multi-accuracy_1=0.534000
INFO:root:Epoch[0] Batch [100] Speed: 32051.30 samples/sec Train-multi-accuracy_0=0.850400
INFO:root:Epoch[0] Batch [100] Speed: 32051.30 samples/sec Train-multi-accuracy_1=0.850400
INFO:root:Epoch[0] Batch [150] Speed: 31249.98 samples/sec Train-multi-accuracy_0=0.887400
INFO:root:Epoch[0] Batch [150] Speed: 31249.98 samples/sec Train-multi-accuracy_1=0.887400
INFO:root:Epoch[0] Batch [200] Speed: 30674.83 samples/sec Train-multi-accuracy_0=0.894000
INFO:root:Epoch[0] Batch [200] Speed: 30674.83 samples/sec Train-multi-accuracy_1=0.894000
INFO:root:Epoch[0] Batch [250] Speed: 31055.90 samples/sec Train-multi-accuracy_0=0.905000
INFO:root:Epoch[0] Batch [250] Speed: 31055.90 samples/sec Train-multi-accuracy_1=0.905000
INFO:root:Epoch[0] Batch [300] Speed: 30674.83 samples/sec Train-multi-accuracy_0=0.909400
INFO:root:Epoch[0] Batch [300] Speed: 30674.83 samples/sec Train-multi-accuracy_1=0.909400
INFO:root:Epoch[0] Batch [350] Speed: 31446.56 samples/sec Train-multi-accuracy_0=0.916000
os.environ["MXNET_CPU_WORKER_NTHREADS"] = "4" Add this to the beginning before importing mxnet
It has been fixed by #4528
os.environ["MXNET_CPU_WORKER_NTHREADS"] = "4" Add this to the beginning before importing mxnet
Need some help, Thank you! Deadlock happend while calling MXNDArraySyncCopyToCPU() ?
I found that when using multiple output custom-ops, the program got stuck. It seems that the engine is suffering from the deadlock. This problem will occur when the custom-op contains the codes like `mx.nd.xx( xx ).asnumpy()'. This problem does not occur when using NaiveEngine.
I have written an example to reproduce this bug. You can put this file on the path of 'exmple/numpy-ops' and then run it. If we add line 15, the program will get stuck. Otherwise it works fine.
MXNet version: test two versions.