NN fails to learn - Githubissues

alexandermorozov commented 8 years ago

After the first iteration NN output gets stuck at value 9:

leaf-examples$ cargo run --release --  mnist linear --batch-size 10 
     Running `target/release/leaf-examples mnist linear --batch-size 10`
Last sample: Prediction: 0, Target: 3 | Accuracy 1/10 = 10.00%
Last sample: Prediction: 9, Target: 4 | Accuracy 2/20 = 10.00%
Last sample: Prediction: 9, Target: 3 | Accuracy 3/30 = 10.00%
Last sample: Prediction: 9, Target: 1 | Accuracy 4/40 = 10.00%
Last sample: Prediction: 9, Target: 3 | Accuracy 7/50 = 14.00%
Last sample: Prediction: 9, Target: 4 | Accuracy 9/60 = 15.00%
Last sample: Prediction: 9, Target: 1 | Accuracy 9/70 = 12.86%
Last sample: Prediction: 9, Target: 9 | Accuracy 10/80 = 12.50%
Last sample: Prediction: 9, Target: 6 | Accuracy 11/90 = 12.22%
Last sample: Prediction: 9, Target: 5 | Accuracy 11/100 = 11.00%
Last sample: Prediction: 9, Target: 9 | Accuracy 12/110 = 10.91%
Last sample: Prediction: 9, Target: 2 | Accuracy 13/120 = 10.83%
Last sample: Prediction: 9, Target: 3 | Accuracy 13/130 = 10.00%
Last sample: Prediction: 9, Target: 7 | Accuracy 14/140 = 10.00%
Last sample: Prediction: 9, Target: 4 | Accuracy 14/150 = 9.33%

This happens nearly always and output is invariably 9 (excluding first iteration). If second process is started while first is still running, then second NN learns and in all 3 models reaches about 90% accuracy. Actually, first process may be example/benchmark from main leaf repo or some other program that heavily uses CUDA. I've tried various combinations, and it looks like that things that matters are: 1) it should allocate sizable memory chunk, 2) it should overwrite it with something. It doesn't matter if the first program is running or in stopped state.

This behavior can be explained if memory where coefficients are stored isn't initialized by leaf with small random values and just contains junk left from other programs. If the junk is good enough, NN learns, but suboptimally. If it's not, it's stuck. Next time program gets the same memory, so if it's stuck, then it's stuck for good. But it would be a security hole if CUDA doesn't zero memory allocations, so my guess may be completely wrong.

Here is my setup:

rustc 1.8.0-beta.1 (facbfdd71 2016-03-02),
debian stretch (9),
cuda 7.0,
cudnn v3, v4 (tried both),
GeForce GTX 960 / 4G RAM, I bought it just yesterday, but HW seems solid and memtestG80 doesn't show any problems,
up-to-date checkout of leaf-examples with Hyper stuff commented out (linking fails on Debian 9 due to openssl API mismatch).

I guess I'll try to code some small simple NN at the weekend and check coefficients at different computation stages.

Edit: formatting

hobofan commented 8 years ago

But it would be a security hole if CUDA doesn't zero memory allocations, so my guess may be completely wrong.

As unintuitive as it seems that's actually the case, and that behaviour recently got some more exposure (https://charliehorse55.wordpress.com/2016/01/09/how-nvidia-breaks-chrome-incognito/).

However that shouldn't have any impact on the way Leaf learns. When the network is created, the weights of Linear and Convolution layers are randomly initialized (See https://github.com/autumnai/leaf/blob/master/src/layers/common/linear.rs#L100), so the initial state of the memory shouldn't really matter. Maybe there is a problem with the filled weights not being synchronized correctly?

Generally I would assume that when one of the examples doesn't learn it's due to bad hyperparameters (batch-size, learning-rate, etc.), but your findings certainly are interesting and I'll look into it.

KodrAus commented 8 years ago

I'm getting the same results on my setup:

target/release/leaf-examples mnist linear --batch-size 10
Last sample: Prediction: 2, Target: 3 | Accuracy 1/10 = 10.00%
Last sample: Prediction: 9, Target: 4 | Accuracy 2/20 = 10.00%
Last sample: Prediction: 9, Target: 3 | Accuracy 3/30 = 10.00%
Last sample: Prediction: 9, Target: 1 | Accuracy 4/40 = 10.00%
Last sample: Prediction: 9, Target: 3 | Accuracy 7/50 = 14.00%
Last sample: Prediction: 9, Target: 4 | Accuracy 9/60 = 15.00%
Last sample: Prediction: 9, Target: 1 | Accuracy 9/70 = 12.86%
Last sample: Prediction: 9, Target: 9 | Accuracy 10/80 = 12.50%
Last sample: Prediction: 9, Target: 6 | Accuracy 11/90 = 12.22%
...

CUDA version 7.5.18
rustc 1.9.0-nightly
cudnn v4
Nvidia GTX Titan X

hobofan commented 8 years ago

I didn't get around to it on the weekend but was able to run it now, and it learned correctly and from the first try:

cargo run --release --  mnist linear --batch-size 10 
   Compiling collenchyma v0.0.8
   Compiling collenchyma-nn v0.3.4
   Compiling collenchyma-blas v0.2.0
   Compiling leaf v0.2.0
   Compiling leaf-examples v0.1.0 (file:///home/hobofan/autumn/leaf-examples)
     Running `target/release/leaf-examples mnist linear --batch-size 10`
target/release/leaf-examples: /opt/cuda/lib64/libOpenCL.so.1: no version information available (required by target/release/leaf-examples)
Last sample: Prediction: 2, Target: 3 | Accuracy 1/10 = 10.00%
Last sample: Prediction: 3, Target: 4 | Accuracy 3/20 = 15.00%
Last sample: Prediction: 4, Target: 3 | Accuracy 4/30 = 13.33%
Last sample: Prediction: 1, Target: 1 | Accuracy 7/40 = 17.50%
Last sample: Prediction: 0, Target: 3 | Accuracy 10/50 = 20.00%
Last sample: Prediction: 2, Target: 4 | Accuracy 12/60 = 20.00%
Last sample: Prediction: 9, Target: 1 | Accuracy 15/70 = 21.43%
Last sample: Prediction: 0, Target: 9 | Accuracy 21/80 = 26.25%
Last sample: Prediction: 6, Target: 6 | Accuracy 26/90 = 28.89%
Last sample: Prediction: 0, Target: 5 | Accuracy 29/100 = 29.00%
Last sample: Prediction: 4, Target: 9 | Accuracy 33/110 = 30.00%
Last sample: Prediction: 3, Target: 2 | Accuracy 40/120 = 33.33%
Last sample: Prediction: 1, Target: 3 | Accuracy 46/130 = 35.38%
Last sample: Prediction: 7, Target: 7 | Accuracy 52/140 = 37.14%
Last sample: Prediction: 5, Target: 4 | Accuracy 56/150 = 37.33%
Last sample: Prediction: 7, Target: 8 | Accuracy 63/160 = 39.38%
Last sample: Prediction: 9, Target: 9 | Accuracy 69/170 = 40.59%

Rust 1.7.0-stable CUDA version 7.5.17 cuDNN v4 NVIDIA GT 750M (2GB RAM)

EDIT:

It also works with my other machine: Rust 1.5.0-stable CUDA version 7.5.17 cuDNN v4 NVIDIA Titan X

KodrAus commented 8 years ago

Hmm, I'll try using the same CUDA and Rust versions as you and see if it changes my results. Will edit with details.

EDIT: No combination of driver or cuda versions seems to work for me:

Ubuntu 15.10
Rust 1.7.0 Stable

nvidia-361.28 (os)
nvidia-352.79 (prop)
nvidia-352.63 (prop)

MarcoPolo commented 8 years ago

Same results here (not learning and always predicting 9)

Machine info:

Rust 1.7 stable Ubuntu 14.04 CUDA v7.5.17 cuDNN v4 nvidia GTX 680

alexandermorozov commented 8 years ago

Yesterday I got it learn from the first try with linear net. Second run also worked. Then I switched to conv and it always returned 9. After that subsequent runs of linearnet returned 9 too.

I've simplified this example a bit by reducing input dimension to 1 and autogenerating training samples, code is here. It has the same behaviour -- sometimes it gets stuck, sometimes it doesn't. Effect doesn't depend on number of layers and batch sizes -- I've got same thing with only one linear layer and batch_size=1. In cases it gets stuck, output of nll layer after the first generation contais some sensible values. On later generations it degrades to all NaNs. Even if learning_rate=0 and values shouldn't change.

I'm currently looking into how to dump intermidiate values and weights to find out when they turn to NaNs. I've got a bit more time now, hopefully'll figure it out this time.

KodrAus commented 8 years ago

@alexandermorozov I'm getting the same NaN results as you on your test code, so far I haven't been able to get any nets to learn.

On another note I had to add a build.rs to your test code to get it to link cu* properly on my machine. How have you got cuda set up on your machine?

alexandermorozov commented 8 years ago

I'm getting the same NaN results as you on your test code, so far I haven't been able to get any nets to learn.

You can try to start two tasks simultaneously. It generally works for me: second task learns more often than not. Though it's difficult to tell if net works as expected: half of neurons might be dead and net may still learn somewhat.

On another note I had to add a build.rs to your test code to get it to link cu* properly on my machine. How have you got cuda set up on your machine?

I'm on Debian testing, common cuda packages are installed from distro repos. libcudnn.so* are manually placed in /usr/local/lib, cudnn.h in /usr/local/include. More importantly Rust switched linker from ld to ld.gold about 3 month ago, and ld.gold doesn't search in /usr/local/lib by default, so environment variable should be set like this: export LIBRARY_PATH="/usr/local/lib". If this doesn't help, can you post error message or content of build.rs? It may be better to create another bug to stay on topic here.

autumnai / leaf-examples

NN fails to learn #13