Open alexandermorozov opened 8 years ago
I've hacked in Cuda
backend in a dirty way, but performance is a lot worse. perf top
shows a lot of time spent in sync()
in cuda lib. Possibly performance degrades because scalars lr
and momentum
are reallocated each time.
I've tried to add those scalars to struct Momentum
, but I get panic:
Running `target/debug/leaf-examples mnist mlp --batch-size 20 --momentum 0.03`
local_lr 0.03
local_lr 0.03
Last sample: Prediction: 5, Target: 4 | Accuracy 2/20 = 10.00%
thread '<main>' panicked at 'called `Result::unwrap()` on an `Err` value: Plugin(Operation("Unable to execute operation gemm."))', ../src/libcore/result.rs:746
stack backtrace:
1: 0x5555ef04d5f0 - sys::backtrace::tracing::imp::write::h1df2fc52f5a1077eehv
2: 0x5555ef04fd7b - panicking::default_handler::_$u7b$$u7b$closure$u7d$$u7d$::closure.44584
3: 0x5555ef04f9e8 - panicking::default_handler::hdd51c136086dc3e0CWz
4: 0x5555ef0454cc - sys_common::unwind::begin_unwind_inner::h3a281a3586db30a7x5t
5: 0x5555ef045bf8 - sys_common::unwind::begin_unwind_fmt::heafd13fa24ecdc24D4t
6: 0x5555ef04cba1 - rust_begin_unwind
7: 0x5555ef07fcef - panicking::panic_fmt::h98b8cbb286f5298alcM
8: 0x5555eed84e58 - result::unwrap_failed::h1490796611167060425
at ../src/libcore/macros.rs:29
9: 0x5555eeda30b6 - result::Result<T, E>::unwrap::h4534188100940040184
at ../src/libcore/result.rs:687
10: 0x5555eedb8afc - layers::common::linear::Linear.ComputeOutput<f32, B>::compute_output::h10814396721572671422
at /home/butler/projects/nn/leaf/src/layers/common/linear.rs:124
11: 0x5555eedb562f - layer::ILayer::forward::h4987377018252282577
at /home/butler/projects/nn/leaf/src/layer.rs:769
12: 0x5555eee0e55e - layer::Layer<B>::forward::h1519220139108033296
at /home/butler/projects/nn/leaf/src/layer.rs:465
13: 0x5555eee0cb21 - layers::common::sequential::Sequential<B>.ILayer<B>::forward::h5245413676380592566
at /home/butler/projects/nn/leaf/src/layers/common/sequential.rs:250
14: 0x5555eee0e55e - layer::Layer<B>::forward::h1519220139108033296
at /home/butler/projects/nn/leaf/src/layer.rs:465
15: 0x5555eee765c9 - solver::Solver<SolverB, B>::train_minibatch::h17553229467080863917
at /home/butler/projects/nn/leaf/src/solver/mod.rs:76
16: 0x5555eed7b76e - run_mnist::h3be3c63c881614b9Mfa
at src/main.rs:191
17: 0x5555eed6fe1f - main::h27645fdaeccde482Zea
at src/main.rs:84
18: 0x5555ef04f644 - sys_common::unwind::try::try_fn::h7821988306635677941
19: 0x5555ef04cb2b - __rust_try
20: 0x5555ef04f0db - rt::lang_start::h582466266dfb2119IOz
21: 0x5555eed7bc69 - main
22: 0x7f1af8a2460f - __libc_start_main
23: 0x5555eed6fac8 - _start
24: 0x0 - <unknown>
I've tried to move initialization from Momentum::new()
into Momentum::compute_update_value()
with no success. If this if let
is commented out, code works, if it's not, it fails after the first iteration. But the code in question does nothing more than just allocation of scalar on Cuda backend, it isn't even used anywhere! But if allocated scalar isn't not stored into self.lr
and is dropped at the end of function, code runs normally.
Any ideas?
Edit: there is cuda-memcheck
, it may help me.
Edit 2: looks like Cuda
framework cannot be instantiated more than once while Native
can and often is.
Edit 3: preliminary version is here
Before patch:
Last sample: Prediction: 5, Target: 5 | Accuracy 947/1000 = 94.70%
Last sample: Prediction: 3, Target: 3 | Accuracy 946/1000 = 94.60%
Last sample: Prediction: 8, Target: 8 | Accuracy 947/1000 = 94.70%
./target/release/leaf-examples mnist mlp --batch-size 20 --momentum 0.03 58.25s user 2.30s system 94% cpu 1:04.03 total
After patch:
Last sample: Prediction: 5, Target: 5 | Accuracy 940/1000 = 94.00%
Last sample: Prediction: 3, Target: 3 | Accuracy 940/1000 = 94.00%
Last sample: Prediction: 8, Target: 8 | Accuracy 940/1000 = 94.00%
./target/release/leaf-examples mnist mlp --batch-size 20 --momentum 0.03 25.14s user 2.32s system 84% cpu 32.569 total
2x speedup and lower CPU usage, nice!
Currently weight updates are calculated on
Native
backend. Profiling shows that about 40% of CPU time is spent doing corresponding BLAS operations. Another 40% are in an area without debug info, quite likely that's nvidia driver doing i/o. In the same time according tonvidia-smi
GPU load is about 20% even on my relatively slow GTX 960.I think it's possible to get 3x-5x speedup if weight updates are implemented on GPU. It should be quite easy since update is a simple BLAS operation
y = a * x + b * y
wherea
andb
are scalars,x
andy
are tensors of equal dimensions.