Closed jonathanponce closed 7 years ago
Here is my output from train_mnist.py:
2016-01-09 12:48:47,622 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus=None, kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='mlp', num_epochs=10, num_examples=60000)
[12:48:51] src/io/iter_mnist.cc:91: MNISTIter: load 60000 images, shuffle=1, shape=(128,784)
[12:48:52] src/io/iter_mnist.cc:91: MNISTIter: load 10000 images, shuffle=1, shape=(128,784)
2016-01-09 12:48:52,053 Node[0] Start training with [cpu(0)]
2016-01-09 12:48:53,105 Node[0] Epoch[0] Batch [50] Speed: 6447.52 samples/sec Train-accuracy=0.686719
2016-01-09 12:48:53,829 Node[0] Epoch[0] Batch [100] Speed: 8836.63 samples/sec Train-accuracy=0.793828
2016-01-09 12:48:54,660 Node[0] Epoch[0] Batch [150] Speed: 7707.90 samples/sec Train-accuracy=0.836302
2016-01-09 12:48:55,366 Node[0] Epoch[0] Batch [200] Speed: 9064.13 samples/sec Train-accuracy=0.858555
2016-01-09 12:48:56,192 Node[0] Epoch[0] Batch [250] Speed: 7749.72 samples/sec Train-accuracy=0.873969
2016-01-09 12:48:57,027 Node[0] Epoch[0] Batch [300] Speed: 7662.28 samples/sec Train-accuracy=0.885052
2016-01-09 12:48:57,808 Node[0] Epoch[0] Batch [350] Speed: 8206.58 samples/sec Train-accuracy=0.893951
2016-01-09 12:48:58,552 Node[0] Epoch[0] Batch [400] Speed: 8606.22 samples/sec Train-accuracy=0.900723
2016-01-09 12:48:59,377 Node[0] Epoch[0] Batch [450] Speed: 7758.36 samples/sec Train-accuracy=0.906563
It looks fine. Did you try pulling the newest change and make clean && make?
here is mine:
C:\mxnet\nocudnn\python\image-classification>D:\Python27\python.exe train_mnist.
py --network lenet --gpus 0
2016-01-09 20:52:15,706 Node[0] start with arguments Namespace(batch_size=128, d
ata_dir='mnist/', gpus='0', kv_store='local', load_epoch=None, lr=0.1, lr_factor
=1, lr_factor_epoch=1, model_prefix=None, network='lenet', num_epochs=10, num_ex
amples=60000)
[20:52:17] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 60000 images
, shuffle=1, shape=(128, 1, 28, 28)
[20:52:18] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 10000 images
, shuffle=1, shape=(128, 1, 28, 28)
2016-01-09 20:52:18,315 Node[0] Start training with [gpu(0)]
2016-01-09 20:52:20,598 Node[0] Epoch[0] Batch [50] Speed: 4719.76 samples/s
ec Train-accuracy=0.096719
2016-01-09 20:52:21,969 Node[0] Epoch[0] Batch [100] Speed: 4668.13 samples/s
ec Train-accuracy=0.098203
2016-01-09 20:52:23,334 Node[0] Epoch[0] Batch [150] Speed: 4688.64 samples/s
ec Train-accuracy=0.100625
2016-01-09 20:52:24,688 Node[0] Epoch[0] Batch [200] Speed: 4723.25 samples/s
ec Train-accuracy=0.100039
2016-01-09 20:52:26,042 Node[0] Epoch[0] Batch [250] Speed: 4726.74 samples/s
ec Train-accuracy=0.098344
2016-01-09 20:52:27,424 Node[0] Epoch[0] Batch [300] Speed: 4634.32 samples/s
ec Train-accuracy=0.099635
2016-01-09 20:52:28,793 Node[0] Epoch[0] Batch [350] Speed: 4671.53 samples/s
ec Train-accuracy=0.099955
As you can see the accuracy remains at the 9% range, and even after the 10 epochs it remains the same, as far as the make part, I downloaded and installed pre-built package for gpu from here https://github.com/dmlc/mxnet/releases
My output with exactly the same command on linux:
python train_mnist.py --network lenet --gpus 0
2016-01-09 14:18:41,245 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus='0', kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='lenet', num_epochs=10, num_examples=60000)
[14:18:43] src/io/iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(128, 1, 28, 28)
[14:18:43] src/io/iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(128, 1, 28, 28)
2016-01-09 14:18:43,402 Node[0] Start training with [gpu(0)]
2016-01-09 14:18:46,866 Node[0] Epoch[0] Batch [50] Speed: 2515.84 samples/sec Train-accuracy=0.810000
2016-01-09 14:18:49,499 Node[0] Epoch[0] Batch [100] Speed: 2431.10 samples/sec Train-accuracy=0.876484
2016-01-09 14:18:52,040 Node[0] Epoch[0] Batch [150] Speed: 2518.40 samples/sec Train-accuracy=0.903073
2016-01-09 14:18:54,563 Node[0] Epoch[0] Batch [200] Speed: 2537.25 samples/sec Train-accuracy=0.918750
2016-01-09 14:18:57,251 Node[0] Epoch[0] Batch [250] Speed: 2380.75 samples/sec Train-accuracy=0.928750
2016-01-09 14:18:59,741 Node[0] Epoch[0] Batch [300] Speed: 2570.31 samples/sec Train-accuracy=0.936120
2016-01-09 14:19:02,343 Node[0] Epoch[0] Batch [350] Speed: 2459.97 samples/sec Train-accuracy=0.941897
2016-01-09 14:19:04,880 Node[0] Epoch[0] Batch [400] Speed: 2523.58 samples/sec Train-accuracy=0.946660
2016-01-09 14:19:07,560 Node[0] Epoch[0] Batch [450] Speed: 2387.78 samples/sec Train-accuracy=0.950122
This seems to be a windows specific issue. @hjk41 Could you look into it?
Mean while, @jonathanponce try using monitor (example in example/python-howto/monitor_weights.py) to check the internal weights and outputs to see if anything is wrong.
Hey I used the monitor to check up on things and something is definitely happening, when I run the program using my cpu, things look quite normal
C:\mxnet\nocudnn\python\image-classification>D:\Python27\python.exe train_mnist.py --network lenet
2016-01-09 22:31:09,315 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus=None, kv_store='lo
cal', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='lenet', num_epochs=10, num_exa
mples=60000)
[22:31:11] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(128, 1, 28, 28)
[22:31:11] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(128, 1, 28, 28)
2016-01-09 22:31:11,933 Node[0] Start training with [cpu(0)]
2016-01-09 22:31:13,413 Node[0] Batch: 1 convolution0_output 0.32209
2016-01-09 22:31:13,413 Node[0] Batch: 1 activation0_output 0.263409
2016-01-09 22:31:13,413 Node[0] Batch: 1 pooling0_output 0.264198
2016-01-09 22:31:13,413 Node[0] Batch: 1 convolution1_output 0.280998
2016-01-09 22:31:13,413 Node[0] Batch: 1 activation1_output 0.259359
2016-01-09 22:31:13,413 Node[0] Batch: 1 pooling1_output 0.283388
2016-01-09 22:31:13,413 Node[0] Batch: 1 flatten0_output 0.283388
2016-01-09 22:31:13,413 Node[0] Batch: 1 fullyconnected0_output 0.246848
2016-01-09 22:31:13,413 Node[0] Batch: 1 activation2_output 0.23317
2016-01-09 22:31:13,413 Node[0] Batch: 1 fullyconnected1_output 0.16215
2016-01-09 22:31:13,413 Node[0] Batch: 1 softmax_output 0.101191
2016-01-09 22:31:13,413 Node[0] Batch: 1 softmax_backward_data 0.301412
2016-01-09 22:31:13,413 Node[0] Batch: 1 softmax_backward_label 0.0
2016-01-09 22:31:13,413 Node[0] Batch: 1 fullyconnected1_backward_data 0.0376285
2016-01-09 22:31:13,413 Node[0] Batch: 1 fullyconnected1_backward_weight 1.13253
2016-01-09 22:31:13,413 Node[0] Batch: 1 fullyconnected1_backward_bias 3.8101
2016-01-09 22:31:13,413 Node[0] Batch: 1 activation2_backward_data 0.0356833
2016-01-09 22:31:13,413 Node[0] Batch: 1 fullyconnected0_backward_data 0.0252012
2016-01-09 22:31:13,413 Node[0] Batch: 1 fullyconnected0_backward_weight 0.163174
2016-01-09 22:31:13,413 Node[0] Batch: 1 fullyconnected0_backward_bias 0.458921
2016-01-09 22:31:13,413 Node[0] Batch: 1 flatten0_backward_data 0.0252012
2016-01-09 22:31:13,413 Node[0] Batch: 1 pooling1_backward_data 0.0126023
2016-01-09 22:31:13,413 Node[0] Batch: 1 activation1_backward_data 0.0116884
2016-01-09 22:31:13,413 Node[0] Batch: 1 convolution1_backward_data 0.010943
2016-01-09 22:31:13,413 Node[0] Batch: 1 convolution1_backward_weight 0.494861
2016-01-09 22:31:13,413 Node[0] Batch: 1 convolution1_backward_bias 1.24864
2016-01-09 22:31:13,413 Node[0] Batch: 1 pooling0_backward_data 0.00705877
2016-01-09 22:31:13,413 Node[0] Batch: 1 activation0_backward_data 0.00671425
2016-01-09 22:31:13,413 Node[0] Batch: 1 convolution0_backward_data 0.0251948
2016-01-09 22:31:13,428 Node[0] Batch: 1 convolution0_backward_weight 0.832047
2016-01-09 22:31:13,428 Node[0] Batch: 1 convolution0_backward_bias 4.85974
2016-01-09 22:31:13,428 Node[0] Batch: 1 data 0.33463
2016-01-09 22:31:13,428 Node[0] Batch: 1 convolution0_weight 0.175653
2016-01-09 22:31:13,428 Node[0] Batch: 1 convolution0_bias 0.00379667
2016-01-09 22:31:13,428 Node[0] Batch: 1 convolution1_weight 0.0395973
2016-01-09 22:31:13,428 Node[0] Batch: 1 convolution1_bias 0.000975498
2016-01-09 22:31:13,428 Node[0] Batch: 1 fullyconnected0_weight 0.031241
2016-01-09 22:31:13,428 Node[0] Batch: 1 fullyconnected0_bias 0.000358532
2016-01-09 22:31:13,428 Node[0] Batch: 1 fullyconnected1_weight 0.0393582
2016-01-09 22:31:13,428 Node[0] Batch: 1 fullyconnected1_bias 0.00297664
2016-01-09 22:31:13,428 Node[0] Batch: 1 softmax_label 5.14174
but when I use my gpu, most of the weights are zero, maybe they are being rounded off or something is wrong with the precision?
C:\mxnet\nocudnn\python\image-classification>D:\Python27\python.exe train_mnist.py --network lenet --gpus 0
2016-01-09 22:31:49,494 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus='0', kv_store='loc
al', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='lenet', num_epochs=10, num_exam
ples=60000)
[22:31:51] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(128, 1, 28, 28)
[22:31:52] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(128, 1, 28, 28)
2016-01-09 22:31:52,048 Node[0] Start training with [gpu(0)]
2016-01-09 22:31:52,996 Node[0] Batch: 1 convolution0_output 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 activation0_output 152988.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 pooling0_output 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 convolution1_output 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 activation1_output 32342.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 pooling1_output 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 flatten0_output 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 fullyconnected0_output 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 activation2_output 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 fullyconnected1_output 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 softmax_output 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 softmax_backward_data 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 softmax_backward_label 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 fullyconnected1_backward_data 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 fullyconnected1_backward_weight 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 fullyconnected1_backward_bias 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 activation2_backward_data 0.0
2016-01-09 22:31:52,996 Node[0] Batch: 1 fullyconnected0_backward_data 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 fullyconnected0_backward_weight 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 fullyconnected0_backward_bias 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 flatten0_backward_data 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 pooling1_backward_data 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 activation1_backward_data 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 convolution1_backward_data 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 convolution1_backward_weight 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 convolution1_backward_bias 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 pooling0_backward_data 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 activation0_backward_data 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 convolution0_backward_data 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 convolution0_backward_weight 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 convolution0_backward_bias 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 data 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 convolution0_weight 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 convolution0_bias 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 convolution1_weight 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 convolution1_bias 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 fullyconnected0_weight 39.2047
2016-01-09 22:31:53,013 Node[0] Batch: 1 fullyconnected0_bias 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 fullyconnected1_weight 0.0
2016-01-09 22:31:53,013 Node[0] Batch: 1 fullyconnected1_bias 390.408
2016-01-09 22:31:53,013 Node[0] Batch: 1 softmax_label 0.0
Could you try to do some simple arithmetic on gpu with
x = mx.nd.zeros((10,10), ctx=mx.gpu(0))
x[:] = 1
x = x*2
print x.asnumpy()
It returns an array of zeros, seems as if the operations are not taking place or are all returning zero
>>> print x.asnumpy()
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Could you try to run cuda's sample code for matrix multiply and see if the results are normal? On Jan 9, 2016 6:18 PM, "jonathanponce" notifications@github.com wrote:
It returns an array of zeros, seems as if the operations are not taking place or are all returning zero
print x.asnumpy() [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
— Reply to this email directly or view it on GitHub https://github.com/dmlc/mxnet/issues/1228#issuecomment-170301104.
I ran the sample code and everything seems to be ok
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GTX 660M" with compute capability 3.0
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 4.40 GFlop/s, Time= 29.805 msec, Size= 131072000 Ops, WorkgroupSize
= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may v
ary when GPU Boost is enabled.
The results are as expected, seems to be something to do with mxnet
I can't reproduce the problem locally so I can't think of anything now. You can try git bisect https://git-scm.com/docs/git-bisect to see if it's a recently introduced bug.
I tried out the previous Windows build and it worked without a problem, so that means windows binary build 20160106 has a bug in the gpu computation section, there have been 29 commits since then so its possible that it has been fixed already.
Even if it is just to back @jonathanponce, I have exactly the same problem. Running train_mnist.py without the --gpus 0 command gives an accuracy of about 0.97, but running with --gpus 0 gives an accuracy of about 0.07
I use Windows 7 64bit with Python 2.7 and have tried windows binary build 20160120 and windows binary build 20160113. Both have the same problem for me.
@hjk41 Looks like gpu code is not running but not reporting error on windows with their cards. Could you look into it?
@piiswrong I watched the gpu load with GPU-Z when running the mxnet code and the gpu load is around 25%, so the code is using my gpu.
This post reports on the same issue: https://www.kaggle.com/c/second-annual-data-science-bowl/forums/t/18079/end-to-end-deep-learning-tutorial-0-0392/105458#post105458
I ran into the same situation as well. Not sure yet if the earlier releases solve the problem.
Same issue here with mxnet and python. I installed the latest windows build 20160202 and while training a network the accuracy wasn't increasing. The computation was taking place on the gpu because I checked it with gpu-z.... Did the simple arithmetic tests on gpu mentioned by @piiswrong and it gave me zeroes.
So I switched to the 20151228 build and now it works ok.
So definately the bug from 20160106 still exists in 20160202. Hope it helps.....
@piiswrong @Quares @JohanManders @gpapadop79 Sorry it take me so long to respond. I was fully occupied with an internal conference last few weeks. I just tried with 20160202 and simple test seems to work alright for me. I guess it must be something in the system configuration side. I am using Windows Server 2012 Datacenter, Python 2.7.10 x64. I will try to switch to some other platform and see if it works there.
Meanwhile, could you help me narrow down the problem a little bit? Here are some speculations:
Python 2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
OpenCV is unavailable.
>>> a = mx.nd.ones((2,3), mx.gpu(0))
>>> a.asnumpy()
array([[ 1., 1., 1.],
[ 1., 1., 1.]], dtype=float32)
>>> x = mx.nd.zeros((10,10), ctx=mx.gpu(0))
>>> x[:] = 1
>>> x = x*2
>>> print x.asnumpy()
[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]
Just tried on another machine with Windows Server 2012R2, Python 2.7.10 x64, it also works fine. :-( I think I need some help here. It would be great if someone is willing to share a machine that can reproduce the problem.
Looks like it's caused by low cuda compute capability GPUs.
Could be. I am running Titan. Does this also occur for low compute capability GPUs on Linux?
I have a GTX 670 and when I boot into Ubuntu, mxnet works fine. In Windows I cannot get it to work.
I ran some tests on my Windows 7 64bit, using windows binary build 20160216. Using a build earlier, does the same for me.
C:\Users\XXXXX>where libmxnet.dll
C:\Anaconda\Lib\site-packages\mxnet-0.5.0-py2.7.egg\mxnet\libmxnet.dll
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Release>matrixMulC
UBLAS.exe
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GeForce GTX 670" with compute capability 3.0`
MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 1059.89 GFlop/s, Time= 0.185 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Python 2.7.11 |Anaconda 2.3.0 (64-bit)| (default, Jan 29 2016, 14:26:21) [MSC v.
1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import mxnet as mx
>>> a = mx.nd.ones((2,3), mx.gpu(0))
>>> a.asnumpy()
array([[ 0., 0., 0.],
[ 0., 0., 0.]], dtype=float32)
>>> x = mx.nd.zeros((10,10), ctx=mx.gpu(0))
>>> x[:] = 1
>>> x = x*2
>>> print x.asnumpy()
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
>>>
@jonathanponce So it is not related to compute capability, since both GTX670 and Titan have compute capability 3.0. Could you try to run a C++ program? You can try to this one: https://github.com/hjk41/MxNet.cpp.git
Checkout the test branch and copy libmxnet.lib/libmxnet.dll to lib/windows/, then build the solution in windows/vs/MxNetTestApp/MxNetTestApp.sln with x64. The program just creates an NDArray on GPU, populate it with ones and then print it out. This is pretty much what mx.nd.ones((2,3), mx.gpu(0)) does.
@hjk41 Did you want me to do the test? If so, I cloned the test branch, copied the dll and lib file (also needed the lib file) and build the solution successfully. I don't know what should happen or how long it should take, but running the program seems to do nothing.
@JohanManders The program should output a series of digits from 0 to 5. If it prints nothing, then there must b something wrong. It means the problem also occurs for c++ programs.
@hjk41 Mmm... Strange... Building CUDA samples like marchingCubes, matrixMulCUBLAS and particles seem to be no problem and run perfectly.
I also ran matrixMulCUBLAS and it passes.
My environment is Windows 7 x64 python 2.7.11 (Anaconda 2.5.0) and GTX 960 (which has compute capability 5.2)
Thanks guys. I think I will have to reinstall one of my machines to use Windows 7 to reproduce the problem, which will need some time. Meanwhile, if someone can try to debug the problem, it would be great. With the C++ program, it shouldn't be too hard.
So I assume the new (7th) release doesn't solve the issue yet? How is it with you @JohanManders? I haven't had time to work on my desktop to test it yet.
@Quares I have tried the latest build, Windows binary build 20160216, and I still have the problem.
I just found that I have the same problem, tried both mnist and cifar10 examples. I am using GTX980 and Windows 10.
I tried different builds and found that all snapshots after build20151228 do not work, also spotted that the file size has been reduced very much since build 20151228 --- are there changes in compilation config?
I tried link to the provided CUDA/CuDNN DLL files and also my own DLLs (same version) via different PATH, did not work either.
Perhaps it maybe a compiler / OS level issue.
@thyu Could you try the C++ program in the test branch of https://github.com/hjk41/MxNet.cpp.git I have recreated the problem with a Windows 10 machine in Python, but the C++ program runs just fine
@hjk41 Seems fine?
$ ./MxnetTestApp.exe
0 1 2 3 4 5
Yes. So it seems to be something in Python/R binding or how they use the library.
On Fri, Feb 19, 2016 at 3:34 AM, thyu notifications@github.com wrote:
@hjk41 https://github.com/hjk41 Seems fine?
$ ./MxnetTestApp.exe 0 1 2 3 4 5
— Reply to this email directly or view it on GitHub https://github.com/dmlc/mxnet/issues/1228#issuecomment-185882232.
HONG Chuntao System Research Group Microsoft Research Asia
@thyu @jonathanponce @JohanManders @Quares @piiswrong Could you help me check with the latest binary build here? https://github.com/dmlc/mxnet/releases/tag/20160223 I think it is the problem with CUDA library. Windows Server 2012 and Windows 10/8 uses different CUDA binaries so I assume there is some difference between the libraries we link. The libmxnet.dll compiled on Windows 2012 does not work on Windows 10/8, and vice versa.
The latest binary was compiled on Windows 10 and it works well on my machine. But I don't have another Windows 10/8/7 machine to test on. So could you help me validate this?
@hjk41 Your latest builds seems to work perfectly! Thanks man, this helps me a lot! The mnist example now outputs a train accuracy of 0.999 and validation accuracy of 0.991.
@JohanManders Great! I will use Windows 10 in the future for building the binary distribution.
Great news! I will test it in the evening and will let you know.
@hjk41 The latest build works!!! Thanks! You rock!!!!
@JohanManders @Quares @hjk41 Did anyone else notice a small decrease in performace on the latest release?
When I trained a model with the 20151228 release, needed about 15.5 sec/epoch. Now with the latest release training the exact same model takes 18.5 sec/epoch.
I am happy that it works, but I also see a big speed difference between Windows and Ubuntu. For Windows I downloaded the latest pre-built package 20160223. For Ubuntu I just downloaded the latest version and build it.
I did two tests on my dual-boot i7 system with a GTX 670:
train_mnist.py
Windows 7 | Build 20160223 : ~ 6750 samples / sec
Ubuntu | Downloaded en build a few minutes ago : ~ 20000 samples / sec
Training on other data
Windows 7 | Build 20160223 : ~ 49 sec / epoch
Ubuntu | Downloaded en build a few minutes ago : ~ 31 sec / epoch
Darn! I must switch to linux! :-P
My speed difference is on win7 between pre-built 20151228 and 20160223. I also tried CuDNN 4 but had no difference.
The new release (20160223) works on my Windows10 machine. Great work guys!
Side note: I also noticed (in my case substantial) performance decrease in terms of speed, but that's probably related to various other things happening.
EDIT: Btw. is it possible to use CUDNN v4? Till now I was under the impression that only v3 is supported.
The new release works on my machine as well, awesome!
I also observe that Linux is faster than Windows for quite a while, I have Linux box with GTX970 which runs around 700 images per sec something over train_cifar10, but in my company's Windows machine it is only slightly higher than 620 img per sec. It might not be just a single-factor issue and perhaps we can improve it afterwards...
Interesting. I will take a look into it. I expect there to be some performance difference between Windows and Linux, but didn't expect it to be so huge.
On Wed, Feb 24, 2016 at 7:33 AM, thyu notifications@github.com wrote:
The new release works on my machine as well, awesome!
I also observe that Linux is faster than Windows for quite a while, I have Linux box with GTX970 which runs around 700 images per sec something over train_cifar10, but in my company's Windows machine it is only slightly higher than 620 img per sec. It might not be just a single-factor issue and perhaps we can improve it afterwards...
— Reply to this email directly or view it on GitHub https://github.com/dmlc/mxnet/issues/1228#issuecomment-187965807.
HONG Chuntao System Research Group Microsoft Research Asia
Here is my results: python train_mnist.py:
GTX980, Windows 10: 20000 samples/sec
Titan, Ubuntu 14.04: 40000 samples/sec
Also, CPU runs much faster in this case, at around 50000 samples/sec.
python train_cifa10.py:
GTX980, Windows 10: 396 samples/sec
Titan, Ubuntu 14.04: 445 samples/sec
So my guess is that Windows has higher overhead with regard to small GPU operations. In MNIST, the computation is so light that this overhead dominates, and thus CPU > GPU-Linux > GPU-Windows. In Cifa10, there are much less operations and hence the difference is much smaller. Has anyone tried running heavier workloads like ImageNet?
@Quares According to this: https://github.com/dmlc/mxnet/pull/1449 cudnn 4 is supported
@gpapadop79 cool! great to know! I was under the impression about cuddn3 because the documentation doesn't reflect on cuddn4 yet. I am interested to see how the performance changes between cudnn3 and cudnn4. I am running both GTX660Ti and GTX980Ti on two seperate machines so have a nice overview of the performance upgrade between the two cards.
Hi,
Installed the GPU enabled R library (R version: 3.2.3) on Windows 7 today. Looks like it's working on the CPU but not on my GPU. The code seems to execute on the GPU (confirmed with GPU-Z), but error improvement stalls on the second round using example code (as above merged issue), and mx.nd.ones(c(2,3), mx.gpu()) generates a table of 0's not 1's.
I'm using the latest files for everything, and the precompiled GPU package for R. I read in this thread: https://github.com/dmlc/mxnet/issues/250, that indicates 'remove USE_CUDNN' to compile for Cuda Compute GPUs 2.1 (and lower). I'm using a 2.1 GPU. Could this be the problem?
Could using an earlier 2015 release be a solution? Also might I have to compile my own GPU-enabled files without USE_CUDNN to fix this? I'm hoping I don't have to upgrade my computer to get his working as I'm only doing preliminary testing.
Any help would be appreciated.
Thanks, Gavin.
I haven't tested it on GPU with with compute capability 2.1. I guess the pre-built binary may not work for you, since it is compiled with compute capability 3.5. Could you try to compile from source and see if it works?
On Sun, Mar 6, 2016 at 11:49 PM, xenmind notifications@github.com wrote:
Hi,
Installed the GPU enabled R library (R version: 3.2.3) on Windows 7 today. Looks like it's working on the CPU but not on my GPU. The code seems to execute on the GPU (confirmed with GPU-Z), but error improvement stalls on the second round using example code (as above merged issue), and mx.nd.ones(c(2,3), mx.gpu()) generates a table of 0's not 1's.
I'm using the latest files for everything, and the precompiled GPU package for R. I read in this thread: #250 https://github.com/dmlc/mxnet/issues/250, that indicates 'remove USE_CUDNN' to compile for Cuda Compute GPUs 2.1 (and lower). I'm using a 2.1 GPU. Could this be the problem?
Could using an earlier 2015 release be a solution? Also might I have to compile my own GPU-enabled files without USE_CUDNN to fix this? I'm hoping I don't have to upgrade my computer to get his working as I'm only doing preliminary testing.
Any help would be appreciated.
Thanks, Gavin.
— Reply to this email directly or view it on GitHub https://github.com/dmlc/mxnet/issues/1228#issuecomment-192919248.
HONG Chuntao System Research Group Microsoft Research Asia
Will do, thanks.
Hey i'm quite new to mxnet, I followed the installation instructions and succeeded in installing it on windows 8.1 64 bit, I then ran the train_mnist.py --network lenet without a problem, quite slow but the accuracy at the end is good at around 99.2, but when I run it as --network lenet --gpus 0 to use my gpu its definitely a lot faster but the accuracy never gets above 10% which is terrible, there must be something wrong theoretically it should be the same accuracy right? I installed cuda 7.5 and also extracted cuddn v3 just as indicated, everything runs without a problem except the accuracy is terrible, i'm running on a laptop with a nvidia 660m graphics card, it has compute capability 3.0.
After running the file I get Train-accuracy=0.098825