apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Windows GPU accuracy extremely bad #1228

Closed jonathanponce closed 7 years ago

jonathanponce commented 8 years ago

Hey i'm quite new to mxnet, I followed the installation instructions and succeeded in installing it on windows 8.1 64 bit, I then ran the train_mnist.py --network lenet without a problem, quite slow but the accuracy at the end is good at around 99.2, but when I run it as --network lenet --gpus 0 to use my gpu its definitely a lot faster but the accuracy never gets above 10% which is terrible, there must be something wrong theoretically it should be the same accuracy right? I installed cuda 7.5 and also extracted cuddn v3 just as indicated, everything runs without a problem except the accuracy is terrible, i'm running on a laptop with a nvidia 660m graphics card, it has compute capability 3.0.

After running the file I get Train-accuracy=0.098825

piiswrong commented 8 years ago

Here is my output from train_mnist.py:

2016-01-09 12:48:47,622 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus=None, kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='mlp', num_epochs=10, num_examples=60000)
[12:48:51] src/io/iter_mnist.cc:91: MNISTIter: load 60000 images, shuffle=1, shape=(128,784)
[12:48:52] src/io/iter_mnist.cc:91: MNISTIter: load 10000 images, shuffle=1, shape=(128,784)
2016-01-09 12:48:52,053 Node[0] Start training with [cpu(0)]
2016-01-09 12:48:53,105 Node[0] Epoch[0] Batch [50] Speed: 6447.52 samples/sec  Train-accuracy=0.686719
2016-01-09 12:48:53,829 Node[0] Epoch[0] Batch [100]    Speed: 8836.63 samples/sec  Train-accuracy=0.793828
2016-01-09 12:48:54,660 Node[0] Epoch[0] Batch [150]    Speed: 7707.90 samples/sec  Train-accuracy=0.836302
2016-01-09 12:48:55,366 Node[0] Epoch[0] Batch [200]    Speed: 9064.13 samples/sec  Train-accuracy=0.858555
2016-01-09 12:48:56,192 Node[0] Epoch[0] Batch [250]    Speed: 7749.72 samples/sec  Train-accuracy=0.873969
2016-01-09 12:48:57,027 Node[0] Epoch[0] Batch [300]    Speed: 7662.28 samples/sec  Train-accuracy=0.885052
2016-01-09 12:48:57,808 Node[0] Epoch[0] Batch [350]    Speed: 8206.58 samples/sec  Train-accuracy=0.893951
2016-01-09 12:48:58,552 Node[0] Epoch[0] Batch [400]    Speed: 8606.22 samples/sec  Train-accuracy=0.900723
2016-01-09 12:48:59,377 Node[0] Epoch[0] Batch [450]    Speed: 7758.36 samples/sec  Train-accuracy=0.906563

It looks fine. Did you try pulling the newest change and make clean && make?

jonathanponce commented 8 years ago

here is mine:


C:\mxnet\nocudnn\python\image-classification>D:\Python27\python.exe train_mnist.
py --network lenet --gpus 0
2016-01-09 20:52:15,706 Node[0] start with arguments Namespace(batch_size=128, d
ata_dir='mnist/', gpus='0', kv_store='local', load_epoch=None, lr=0.1, lr_factor
=1, lr_factor_epoch=1, model_prefix=None, network='lenet', num_epochs=10, num_ex
amples=60000)
[20:52:17] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 60000 images
, shuffle=1, shape=(128, 1, 28, 28)
[20:52:18] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 10000 images
, shuffle=1, shape=(128, 1, 28, 28)
2016-01-09 20:52:18,315 Node[0] Start training with [gpu(0)]
2016-01-09 20:52:20,598 Node[0] Epoch[0] Batch [50]     Speed: 4719.76 samples/s
ec      Train-accuracy=0.096719
2016-01-09 20:52:21,969 Node[0] Epoch[0] Batch [100]    Speed: 4668.13 samples/s
ec      Train-accuracy=0.098203
2016-01-09 20:52:23,334 Node[0] Epoch[0] Batch [150]    Speed: 4688.64 samples/s
ec      Train-accuracy=0.100625
2016-01-09 20:52:24,688 Node[0] Epoch[0] Batch [200]    Speed: 4723.25 samples/s
ec      Train-accuracy=0.100039
2016-01-09 20:52:26,042 Node[0] Epoch[0] Batch [250]    Speed: 4726.74 samples/s
ec      Train-accuracy=0.098344
2016-01-09 20:52:27,424 Node[0] Epoch[0] Batch [300]    Speed: 4634.32 samples/s
ec      Train-accuracy=0.099635
2016-01-09 20:52:28,793 Node[0] Epoch[0] Batch [350]    Speed: 4671.53 samples/s
ec      Train-accuracy=0.099955

As you can see the accuracy remains at the 9% range, and even after the 10 epochs it remains the same, as far as the make part, I downloaded and installed pre-built package for gpu from here https://github.com/dmlc/mxnet/releases

piiswrong commented 8 years ago

My output with exactly the same command on linux:

python train_mnist.py --network lenet --gpus 0
2016-01-09 14:18:41,245 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus='0', kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='lenet', num_epochs=10, num_examples=60000)
[14:18:43] src/io/iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(128, 1, 28, 28)
[14:18:43] src/io/iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(128, 1, 28, 28)
2016-01-09 14:18:43,402 Node[0] Start training with [gpu(0)]
2016-01-09 14:18:46,866 Node[0] Epoch[0] Batch [50] Speed: 2515.84 samples/sec  Train-accuracy=0.810000
2016-01-09 14:18:49,499 Node[0] Epoch[0] Batch [100]    Speed: 2431.10 samples/sec  Train-accuracy=0.876484
2016-01-09 14:18:52,040 Node[0] Epoch[0] Batch [150]    Speed: 2518.40 samples/sec  Train-accuracy=0.903073
2016-01-09 14:18:54,563 Node[0] Epoch[0] Batch [200]    Speed: 2537.25 samples/sec  Train-accuracy=0.918750
2016-01-09 14:18:57,251 Node[0] Epoch[0] Batch [250]    Speed: 2380.75 samples/sec  Train-accuracy=0.928750
2016-01-09 14:18:59,741 Node[0] Epoch[0] Batch [300]    Speed: 2570.31 samples/sec  Train-accuracy=0.936120
2016-01-09 14:19:02,343 Node[0] Epoch[0] Batch [350]    Speed: 2459.97 samples/sec  Train-accuracy=0.941897
2016-01-09 14:19:04,880 Node[0] Epoch[0] Batch [400]    Speed: 2523.58 samples/sec  Train-accuracy=0.946660
2016-01-09 14:19:07,560 Node[0] Epoch[0] Batch [450]    Speed: 2387.78 samples/sec  Train-accuracy=0.950122

This seems to be a windows specific issue. @hjk41 Could you look into it?

Mean while, @jonathanponce try using monitor (example in example/python-howto/monitor_weights.py) to check the internal weights and outputs to see if anything is wrong.

jonathanponce commented 8 years ago

Hey I used the monitor to check up on things and something is definitely happening, when I run the program using my cpu, things look quite normal


C:\mxnet\nocudnn\python\image-classification>D:\Python27\python.exe train_mnist.py --network lenet
2016-01-09 22:31:09,315 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus=None, kv_store='lo
cal', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='lenet', num_epochs=10, num_exa
mples=60000)
[22:31:11] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(128, 1, 28, 28)
[22:31:11] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(128, 1, 28, 28)
2016-01-09 22:31:11,933 Node[0] Start training with [cpu(0)]
2016-01-09 22:31:13,413 Node[0] Batch:       1 convolution0_output            0.32209
2016-01-09 22:31:13,413 Node[0] Batch:       1 activation0_output             0.263409
2016-01-09 22:31:13,413 Node[0] Batch:       1 pooling0_output                0.264198
2016-01-09 22:31:13,413 Node[0] Batch:       1 convolution1_output            0.280998
2016-01-09 22:31:13,413 Node[0] Batch:       1 activation1_output             0.259359
2016-01-09 22:31:13,413 Node[0] Batch:       1 pooling1_output                0.283388
2016-01-09 22:31:13,413 Node[0] Batch:       1 flatten0_output                0.283388
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected0_output         0.246848
2016-01-09 22:31:13,413 Node[0] Batch:       1 activation2_output             0.23317
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected1_output         0.16215
2016-01-09 22:31:13,413 Node[0] Batch:       1 softmax_output                 0.101191
2016-01-09 22:31:13,413 Node[0] Batch:       1 softmax_backward_data          0.301412
2016-01-09 22:31:13,413 Node[0] Batch:       1 softmax_backward_label         0.0
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected1_backward_data  0.0376285
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected1_backward_weight 1.13253
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected1_backward_bias  3.8101
2016-01-09 22:31:13,413 Node[0] Batch:       1 activation2_backward_data      0.0356833
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected0_backward_data  0.0252012
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected0_backward_weight 0.163174
2016-01-09 22:31:13,413 Node[0] Batch:       1 fullyconnected0_backward_bias  0.458921
2016-01-09 22:31:13,413 Node[0] Batch:       1 flatten0_backward_data         0.0252012
2016-01-09 22:31:13,413 Node[0] Batch:       1 pooling1_backward_data         0.0126023
2016-01-09 22:31:13,413 Node[0] Batch:       1 activation1_backward_data      0.0116884
2016-01-09 22:31:13,413 Node[0] Batch:       1 convolution1_backward_data     0.010943
2016-01-09 22:31:13,413 Node[0] Batch:       1 convolution1_backward_weight   0.494861
2016-01-09 22:31:13,413 Node[0] Batch:       1 convolution1_backward_bias     1.24864
2016-01-09 22:31:13,413 Node[0] Batch:       1 pooling0_backward_data         0.00705877
2016-01-09 22:31:13,413 Node[0] Batch:       1 activation0_backward_data      0.00671425
2016-01-09 22:31:13,413 Node[0] Batch:       1 convolution0_backward_data     0.0251948
2016-01-09 22:31:13,428 Node[0] Batch:       1 convolution0_backward_weight   0.832047
2016-01-09 22:31:13,428 Node[0] Batch:       1 convolution0_backward_bias     4.85974
2016-01-09 22:31:13,428 Node[0] Batch:       1 data                           0.33463
2016-01-09 22:31:13,428 Node[0] Batch:       1 convolution0_weight            0.175653
2016-01-09 22:31:13,428 Node[0] Batch:       1 convolution0_bias              0.00379667
2016-01-09 22:31:13,428 Node[0] Batch:       1 convolution1_weight            0.0395973
2016-01-09 22:31:13,428 Node[0] Batch:       1 convolution1_bias              0.000975498
2016-01-09 22:31:13,428 Node[0] Batch:       1 fullyconnected0_weight         0.031241
2016-01-09 22:31:13,428 Node[0] Batch:       1 fullyconnected0_bias           0.000358532
2016-01-09 22:31:13,428 Node[0] Batch:       1 fullyconnected1_weight         0.0393582
2016-01-09 22:31:13,428 Node[0] Batch:       1 fullyconnected1_bias           0.00297664
2016-01-09 22:31:13,428 Node[0] Batch:       1 softmax_label                  5.14174

but when I use my gpu, most of the weights are zero, maybe they are being rounded off or something is wrong with the precision?

C:\mxnet\nocudnn\python\image-classification>D:\Python27\python.exe train_mnist.py --network lenet --gpus 0
2016-01-09 22:31:49,494 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus='0', kv_store='loc
al', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, network='lenet', num_epochs=10, num_exam
ples=60000)
[22:31:51] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(128, 1, 28, 28)
[22:31:52] D:\chhong\mxnet\src\io\iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(128, 1, 28, 28)
2016-01-09 22:31:52,048 Node[0] Start training with [gpu(0)]
2016-01-09 22:31:52,996 Node[0] Batch:       1 convolution0_output            0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 activation0_output             152988.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 pooling0_output                0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 convolution1_output            0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 activation1_output             32342.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 pooling1_output                0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 flatten0_output                0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 fullyconnected0_output         0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 activation2_output             0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 fullyconnected1_output         0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 softmax_output                 0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 softmax_backward_data          0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 softmax_backward_label         0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 fullyconnected1_backward_data  0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 fullyconnected1_backward_weight 0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 fullyconnected1_backward_bias  0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 activation2_backward_data      0.0
2016-01-09 22:31:52,996 Node[0] Batch:       1 fullyconnected0_backward_data  0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 fullyconnected0_backward_weight 0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 fullyconnected0_backward_bias  0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 flatten0_backward_data         0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 pooling1_backward_data         0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 activation1_backward_data      0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution1_backward_data     0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution1_backward_weight   0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution1_backward_bias     0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 pooling0_backward_data         0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 activation0_backward_data      0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution0_backward_data     0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution0_backward_weight   0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution0_backward_bias     0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 data                           0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution0_weight            0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution0_bias              0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution1_weight            0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 convolution1_bias              0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 fullyconnected0_weight         39.2047
2016-01-09 22:31:53,013 Node[0] Batch:       1 fullyconnected0_bias           0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 fullyconnected1_weight         0.0
2016-01-09 22:31:53,013 Node[0] Batch:       1 fullyconnected1_bias           390.408
2016-01-09 22:31:53,013 Node[0] Batch:       1 softmax_label                  0.0
piiswrong commented 8 years ago

Could you try to do some simple arithmetic on gpu with

x = mx.nd.zeros((10,10), ctx=mx.gpu(0))
x[:] = 1
x = x*2
print x.asnumpy()
jonathanponce commented 8 years ago

It returns an array of zeros, seems as if the operations are not taking place or are all returning zero

>>> print x.asnumpy()
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]
piiswrong commented 8 years ago

Could you try to run cuda's sample code for matrix multiply and see if the results are normal? On Jan 9, 2016 6:18 PM, "jonathanponce" notifications@github.com wrote:

It returns an array of zeros, seems as if the operations are not taking place or are all returning zero

print x.asnumpy() [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

— Reply to this email directly or view it on GitHub https://github.com/dmlc/mxnet/issues/1228#issuecomment-170301104.

jonathanponce commented 8 years ago

I ran the sample code and everything seems to be ok

[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GTX 660M" with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 4.40 GFlop/s, Time= 29.805 msec, Size= 131072000 Ops, WorkgroupSize
= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may v
ary when GPU Boost is enabled.

The results are as expected, seems to be something to do with mxnet

piiswrong commented 8 years ago

I can't reproduce the problem locally so I can't think of anything now. You can try git bisect https://git-scm.com/docs/git-bisect to see if it's a recently introduced bug.

jonathanponce commented 8 years ago

I tried out the previous Windows build and it worked without a problem, so that means windows binary build 20160106 has a bug in the gpu computation section, there have been 29 commits since then so its possible that it has been fixed already.

JohanManders commented 8 years ago

Even if it is just to back @jonathanponce, I have exactly the same problem. Running train_mnist.py without the --gpus 0 command gives an accuracy of about 0.97, but running with --gpus 0 gives an accuracy of about 0.07

I use Windows 7 64bit with Python 2.7 and have tried windows binary build 20160120 and windows binary build 20160113. Both have the same problem for me.

piiswrong commented 8 years ago

@hjk41 Looks like gpu code is not running but not reporting error on windows with their cards. Could you look into it?

JohanManders commented 8 years ago

@piiswrong I watched the gpu load with GPU-Z when running the mxnet code and the gpu load is around 25%, so the code is using my gpu.

Quares commented 8 years ago

This post reports on the same issue: https://www.kaggle.com/c/second-annual-data-science-bowl/forums/t/18079/end-to-end-deep-learning-tutorial-0-0392/105458#post105458

I ran into the same situation as well. Not sure yet if the earlier releases solve the problem.

gpapadop79 commented 8 years ago

Same issue here with mxnet and python. I installed the latest windows build 20160202 and while training a network the accuracy wasn't increasing. The computation was taking place on the gpu because I checked it with gpu-z.... Did the simple arithmetic tests on gpu mentioned by @piiswrong and it gave me zeroes.

So I switched to the 20151228 build and now it works ok.

So definately the bug from 20160106 still exists in 20160202. Hope it helps.....

hjk41 commented 8 years ago

@piiswrong @Quares @JohanManders @gpapadop79 Sorry it take me so long to respond. I was fully occupied with an internal conference last few weeks. I just tried with 20160202 and simple test seems to work alright for me. I guess it must be something in the system configuration side. I am using Windows Server 2012 Datacenter, Python 2.7.10 x64. I will try to switch to some other platform and see if it works there.

Meanwhile, could you help me narrow down the problem a little bit? Here are some speculations:

  1. run "where libmxnet.dll" and see if you are using the right version of libmxnet.dll
  2. run matrixMulCuBLAS from nvidia CUDA samples and see if it works
  3. try building mxnet from source and do the test again
Python 2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
OpenCV is unavailable.
>>> a = mx.nd.ones((2,3), mx.gpu(0))
>>> a.asnumpy()
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]], dtype=float32)
>>> x = mx.nd.zeros((10,10), ctx=mx.gpu(0))
>>> x[:] = 1
>>> x = x*2
>>> print x.asnumpy()
[[ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.  2.  2.]]
hjk41 commented 8 years ago

Just tried on another machine with Windows Server 2012R2, Python 2.7.10 x64, it also works fine. :-( I think I need some help here. It would be great if someone is willing to share a machine that can reproduce the problem.

piiswrong commented 8 years ago

Looks like it's caused by low cuda compute capability GPUs.

hjk41 commented 8 years ago

Could be. I am running Titan. Does this also occur for low compute capability GPUs on Linux?

JohanManders commented 8 years ago

I have a GTX 670 and when I boot into Ubuntu, mxnet works fine. In Windows I cannot get it to work.

I ran some tests on my Windows 7 64bit, using windows binary build 20160216. Using a build earlier, does the same for me.

C:\Users\XXXXX>where libmxnet.dll
C:\Anaconda\Lib\site-packages\mxnet-0.5.0-py2.7.egg\mxnet\libmxnet.dll
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Release>matrixMulC
UBLAS.exe
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GeForce GTX 670" with compute capability 3.0`

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 1059.89 GFlop/s, Time= 0.185 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Python 2.7.11 |Anaconda 2.3.0 (64-bit)| (default, Jan 29 2016, 14:26:21) [MSC v.
1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import mxnet as mx
>>> a = mx.nd.ones((2,3), mx.gpu(0))
>>> a.asnumpy()
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]], dtype=float32)
>>> x = mx.nd.zeros((10,10), ctx=mx.gpu(0))
>>> x[:] = 1
>>> x = x*2
>>> print x.asnumpy()
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]
>>>
hjk41 commented 8 years ago

@jonathanponce So it is not related to compute capability, since both GTX670 and Titan have compute capability 3.0. Could you try to run a C++ program? You can try to this one: https://github.com/hjk41/MxNet.cpp.git

Checkout the test branch and copy libmxnet.lib/libmxnet.dll to lib/windows/, then build the solution in windows/vs/MxNetTestApp/MxNetTestApp.sln with x64. The program just creates an NDArray on GPU, populate it with ones and then print it out. This is pretty much what mx.nd.ones((2,3), mx.gpu(0)) does.

JohanManders commented 8 years ago

@hjk41 Did you want me to do the test? If so, I cloned the test branch, copied the dll and lib file (also needed the lib file) and build the solution successfully. I don't know what should happen or how long it should take, but running the program seems to do nothing.

hjk41 commented 8 years ago

@JohanManders The program should output a series of digits from 0 to 5. If it prints nothing, then there must b something wrong. It means the problem also occurs for c++ programs.

JohanManders commented 8 years ago

@hjk41 Mmm... Strange... Building CUDA samples like marchingCubes, matrixMulCUBLAS and particles seem to be no problem and run perfectly.

gpapadop79 commented 8 years ago

I also ran matrixMulCUBLAS and it passes.

My environment is Windows 7 x64 python 2.7.11 (Anaconda 2.5.0) and GTX 960 (which has compute capability 5.2)

hjk41 commented 8 years ago

Thanks guys. I think I will have to reinstall one of my machines to use Windows 7 to reproduce the problem, which will need some time. Meanwhile, if someone can try to debug the problem, it would be great. With the C++ program, it shouldn't be too hard.

Quares commented 8 years ago

So I assume the new (7th) release doesn't solve the issue yet? How is it with you @JohanManders? I haven't had time to work on my desktop to test it yet.

JohanManders commented 8 years ago

@Quares I have tried the latest build, Windows binary build 20160216, and I still have the problem.

thyu commented 8 years ago

I just found that I have the same problem, tried both mnist and cifar10 examples. I am using GTX980 and Windows 10.

I tried different builds and found that all snapshots after build20151228 do not work, also spotted that the file size has been reduced very much since build 20151228 --- are there changes in compilation config?

thyu commented 8 years ago

I tried link to the provided CUDA/CuDNN DLL files and also my own DLLs (same version) via different PATH, did not work either.

Perhaps it maybe a compiler / OS level issue.

hjk41 commented 8 years ago

@thyu Could you try the C++ program in the test branch of https://github.com/hjk41/MxNet.cpp.git I have recreated the problem with a Windows 10 machine in Python, but the C++ program runs just fine

thyu commented 8 years ago

@hjk41 Seems fine?

$ ./MxnetTestApp.exe
0 1 2 3 4 5
hjk41 commented 8 years ago

Yes. So it seems to be something in Python/R binding or how they use the library.

On Fri, Feb 19, 2016 at 3:34 AM, thyu notifications@github.com wrote:

@hjk41 https://github.com/hjk41 Seems fine?

$ ./MxnetTestApp.exe 0 1 2 3 4 5

— Reply to this email directly or view it on GitHub https://github.com/dmlc/mxnet/issues/1228#issuecomment-185882232.

HONG Chuntao System Research Group Microsoft Research Asia

hjk41 commented 8 years ago

@thyu @jonathanponce @JohanManders @Quares @piiswrong Could you help me check with the latest binary build here? https://github.com/dmlc/mxnet/releases/tag/20160223 I think it is the problem with CUDA library. Windows Server 2012 and Windows 10/8 uses different CUDA binaries so I assume there is some difference between the libraries we link. The libmxnet.dll compiled on Windows 2012 does not work on Windows 10/8, and vice versa.

The latest binary was compiled on Windows 10 and it works well on my machine. But I don't have another Windows 10/8/7 machine to test on. So could you help me validate this?

JohanManders commented 8 years ago

@hjk41 Your latest builds seems to work perfectly! Thanks man, this helps me a lot! The mnist example now outputs a train accuracy of 0.999 and validation accuracy of 0.991.

hjk41 commented 8 years ago

@JohanManders Great! I will use Windows 10 in the future for building the binary distribution.

Quares commented 8 years ago

Great news! I will test it in the evening and will let you know.

gpapadop79 commented 8 years ago

@hjk41 The latest build works!!! Thanks! You rock!!!!

gpapadop79 commented 8 years ago

@JohanManders @Quares @hjk41 Did anyone else notice a small decrease in performace on the latest release?

When I trained a model with the 20151228 release, needed about 15.5 sec/epoch. Now with the latest release training the exact same model takes 18.5 sec/epoch.

JohanManders commented 8 years ago

I am happy that it works, but I also see a big speed difference between Windows and Ubuntu. For Windows I downloaded the latest pre-built package 20160223. For Ubuntu I just downloaded the latest version and build it.

I did two tests on my dual-boot i7 system with a GTX 670:

train_mnist.py

Windows 7 | Build 20160223                        : ~  6750 samples / sec
Ubuntu    | Downloaded en build a few minutes ago : ~ 20000 samples / sec

Training on other data

Windows 7 | Build 20160223                        : ~ 49 sec / epoch
Ubuntu    | Downloaded en build a few minutes ago : ~ 31 sec / epoch
gpapadop79 commented 8 years ago

Darn! I must switch to linux! :-P

My speed difference is on win7 between pre-built 20151228 and 20160223. I also tried CuDNN 4 but had no difference.

Quares commented 8 years ago

The new release (20160223) works on my Windows10 machine. Great work guys!

Side note: I also noticed (in my case substantial) performance decrease in terms of speed, but that's probably related to various other things happening.

EDIT: Btw. is it possible to use CUDNN v4? Till now I was under the impression that only v3 is supported.

thyu commented 8 years ago

The new release works on my machine as well, awesome!

I also observe that Linux is faster than Windows for quite a while, I have Linux box with GTX970 which runs around 700 images per sec something over train_cifar10, but in my company's Windows machine it is only slightly higher than 620 img per sec. It might not be just a single-factor issue and perhaps we can improve it afterwards...

hjk41 commented 8 years ago

Interesting. I will take a look into it. I expect there to be some performance difference between Windows and Linux, but didn't expect it to be so huge.

On Wed, Feb 24, 2016 at 7:33 AM, thyu notifications@github.com wrote:

The new release works on my machine as well, awesome!

I also observe that Linux is faster than Windows for quite a while, I have Linux box with GTX970 which runs around 700 images per sec something over train_cifar10, but in my company's Windows machine it is only slightly higher than 620 img per sec. It might not be just a single-factor issue and perhaps we can improve it afterwards...

— Reply to this email directly or view it on GitHub https://github.com/dmlc/mxnet/issues/1228#issuecomment-187965807.

HONG Chuntao System Research Group Microsoft Research Asia

hjk41 commented 8 years ago

Here is my results: python train_mnist.py:

GTX980, Windows 10:               20000 samples/sec
Titan, Ubuntu 14.04:                  40000 samples/sec

Also, CPU runs much faster in this case, at around 50000 samples/sec.

python train_cifa10.py:

GTX980, Windows 10:               396 samples/sec
Titan, Ubuntu 14.04:                  445 samples/sec

So my guess is that Windows has higher overhead with regard to small GPU operations. In MNIST, the computation is so light that this overhead dominates, and thus CPU > GPU-Linux > GPU-Windows. In Cifa10, there are much less operations and hence the difference is much smaller. Has anyone tried running heavier workloads like ImageNet?

gpapadop79 commented 8 years ago

@Quares According to this: https://github.com/dmlc/mxnet/pull/1449 cudnn 4 is supported

Quares commented 8 years ago

@gpapadop79 cool! great to know! I was under the impression about cuddn3 because the documentation doesn't reflect on cuddn4 yet. I am interested to see how the performance changes between cudnn3 and cudnn4. I am running both GTX660Ti and GTX980Ti on two seperate machines so have a nice overview of the performance upgrade between the two cards.

xenmind commented 8 years ago

Hi,

Installed the GPU enabled R library (R version: 3.2.3) on Windows 7 today. Looks like it's working on the CPU but not on my GPU. The code seems to execute on the GPU (confirmed with GPU-Z), but error improvement stalls on the second round using example code (as above merged issue), and mx.nd.ones(c(2,3), mx.gpu()) generates a table of 0's not 1's.

I'm using the latest files for everything, and the precompiled GPU package for R. I read in this thread: https://github.com/dmlc/mxnet/issues/250, that indicates 'remove USE_CUDNN' to compile for Cuda Compute GPUs 2.1 (and lower). I'm using a 2.1 GPU. Could this be the problem?

Could using an earlier 2015 release be a solution? Also might I have to compile my own GPU-enabled files without USE_CUDNN to fix this? I'm hoping I don't have to upgrade my computer to get his working as I'm only doing preliminary testing.

Any help would be appreciated.

Thanks, Gavin.

hjk41 commented 8 years ago

I haven't tested it on GPU with with compute capability 2.1. I guess the pre-built binary may not work for you, since it is compiled with compute capability 3.5. Could you try to compile from source and see if it works?

On Sun, Mar 6, 2016 at 11:49 PM, xenmind notifications@github.com wrote:

Hi,

Installed the GPU enabled R library (R version: 3.2.3) on Windows 7 today. Looks like it's working on the CPU but not on my GPU. The code seems to execute on the GPU (confirmed with GPU-Z), but error improvement stalls on the second round using example code (as above merged issue), and mx.nd.ones(c(2,3), mx.gpu()) generates a table of 0's not 1's.

I'm using the latest files for everything, and the precompiled GPU package for R. I read in this thread: #250 https://github.com/dmlc/mxnet/issues/250, that indicates 'remove USE_CUDNN' to compile for Cuda Compute GPUs 2.1 (and lower). I'm using a 2.1 GPU. Could this be the problem?

Could using an earlier 2015 release be a solution? Also might I have to compile my own GPU-enabled files without USE_CUDNN to fix this? I'm hoping I don't have to upgrade my computer to get his working as I'm only doing preliminary testing.

Any help would be appreciated.

Thanks, Gavin.

— Reply to this email directly or view it on GitHub https://github.com/dmlc/mxnet/issues/1228#issuecomment-192919248.

HONG Chuntao System Research Group Microsoft Research Asia

xenmind commented 8 years ago

Will do, thanks.