autumnai / leaf

Open Machine Intelligence Framework for Hackers. (GPU/CPU)
Apache License 2.0
5.56k stars 272 forks source link

Use GPU for SGD weight update calculations #88

Open alexandermorozov opened 8 years ago

alexandermorozov commented 8 years ago

Currently weight updates are calculated on Native backend. Profiling shows that about 40% of CPU time is spent doing corresponding BLAS operations. Another 40% are in an area without debug info, quite likely that's nvidia driver doing i/o. In the same time according to nvidia-smi GPU load is about 20% even on my relatively slow GTX 960.

I think it's possible to get 3x-5x speedup if weight updates are implemented on GPU. It should be quite easy since update is a simple BLAS operation y = a * x + b * y where a and b are scalars, x and y are tensors of equal dimensions.

alexandermorozov commented 8 years ago

I've hacked in Cuda backend in a dirty way, but performance is a lot worse. perf top shows a lot of time spent in sync() in cuda lib. Possibly performance degrades because scalars lr and momentum are reallocated each time.

I've tried to add those scalars to struct Momentum, but I get panic:

 Running `target/debug/leaf-examples mnist mlp --batch-size 20 --momentum 0.03`
local_lr 0.03
local_lr 0.03
Last sample: Prediction: 5, Target: 4 | Accuracy 2/20 = 10.00%
thread '<main>' panicked at 'called `Result::unwrap()` on an `Err` value: Plugin(Operation("Unable to execute operation gemm."))', ../src/libcore/result.rs:746
stack backtrace:
   1:     0x5555ef04d5f0 - sys::backtrace::tracing::imp::write::h1df2fc52f5a1077eehv
   2:     0x5555ef04fd7b - panicking::default_handler::_$u7b$$u7b$closure$u7d$$u7d$::closure.44584
   3:     0x5555ef04f9e8 - panicking::default_handler::hdd51c136086dc3e0CWz
   4:     0x5555ef0454cc - sys_common::unwind::begin_unwind_inner::h3a281a3586db30a7x5t
   5:     0x5555ef045bf8 - sys_common::unwind::begin_unwind_fmt::heafd13fa24ecdc24D4t
   6:     0x5555ef04cba1 - rust_begin_unwind
   7:     0x5555ef07fcef - panicking::panic_fmt::h98b8cbb286f5298alcM
   8:     0x5555eed84e58 - result::unwrap_failed::h1490796611167060425
                        at ../src/libcore/macros.rs:29
   9:     0x5555eeda30b6 - result::Result<T, E>::unwrap::h4534188100940040184
                        at ../src/libcore/result.rs:687
  10:     0x5555eedb8afc - layers::common::linear::Linear.ComputeOutput<f32, B>::compute_output::h10814396721572671422
                        at /home/butler/projects/nn/leaf/src/layers/common/linear.rs:124
  11:     0x5555eedb562f - layer::ILayer::forward::h4987377018252282577
                        at /home/butler/projects/nn/leaf/src/layer.rs:769
  12:     0x5555eee0e55e - layer::Layer<B>::forward::h1519220139108033296
                        at /home/butler/projects/nn/leaf/src/layer.rs:465
  13:     0x5555eee0cb21 - layers::common::sequential::Sequential<B>.ILayer<B>::forward::h5245413676380592566
                        at /home/butler/projects/nn/leaf/src/layers/common/sequential.rs:250
  14:     0x5555eee0e55e - layer::Layer<B>::forward::h1519220139108033296
                        at /home/butler/projects/nn/leaf/src/layer.rs:465
  15:     0x5555eee765c9 - solver::Solver<SolverB, B>::train_minibatch::h17553229467080863917
                        at /home/butler/projects/nn/leaf/src/solver/mod.rs:76
  16:     0x5555eed7b76e - run_mnist::h3be3c63c881614b9Mfa
                        at src/main.rs:191
  17:     0x5555eed6fe1f - main::h27645fdaeccde482Zea
                        at src/main.rs:84
  18:     0x5555ef04f644 - sys_common::unwind::try::try_fn::h7821988306635677941
  19:     0x5555ef04cb2b - __rust_try
  20:     0x5555ef04f0db - rt::lang_start::h582466266dfb2119IOz
  21:     0x5555eed7bc69 - main
  22:     0x7f1af8a2460f - __libc_start_main
  23:     0x5555eed6fac8 - _start
  24:                0x0 - <unknown>

I've tried to move initialization from Momentum::new() into Momentum::compute_update_value() with no success. If this if let is commented out, code works, if it's not, it fails after the first iteration. But the code in question does nothing more than just allocation of scalar on Cuda backend, it isn't even used anywhere! But if allocated scalar isn't not stored into self.lr and is dropped at the end of function, code runs normally.

Any ideas?

Edit: there is cuda-memcheck, it may help me. Edit 2: looks like Cuda framework cannot be instantiated more than once while Native can and often is. Edit 3: preliminary version is here

Before patch:

Last sample: Prediction: 5, Target: 5 | Accuracy 947/1000 = 94.70%
Last sample: Prediction: 3, Target: 3 | Accuracy 946/1000 = 94.60%
Last sample: Prediction: 8, Target: 8 | Accuracy 947/1000 = 94.70%
./target/release/leaf-examples mnist mlp --batch-size 20 --momentum 0.03  58.25s user 2.30s system 94% cpu 1:04.03 total

After patch:

Last sample: Prediction: 5, Target: 5 | Accuracy 940/1000 = 94.00%
Last sample: Prediction: 3, Target: 3 | Accuracy 940/1000 = 94.00%
Last sample: Prediction: 8, Target: 8 | Accuracy 940/1000 = 94.00%
./target/release/leaf-examples mnist mlp --batch-size 20 --momentum 0.03  25.14s user 2.32s system 84% cpu 32.569 total

2x speedup and lower CPU usage, nice!