Closed wukan1986 closed 5 months ago
in https://github.com/rust-ndarray/ndarray-linalg
I can run intel-mkl
directly, but after struggling for half an hour, I still can't run openblas
cargo run --example solve --features=intel-mkl
cargo test --features=intel-mkl
Hi @wukan1986, thanks
when you ran the benchmark did you modify any code to add nulls and change the null policy, or did you run as-is?
Because that original code should do no null handling (ignore) and I wrote the code in a way that introduced little overhead in case null_policy is Ignore.
Assuming that’s what you ran too, then the slowness is due to the code path on Linux using my own SVD solver instead of the optimized LAPACK one.
If it’s the NaN handling that you tested in a custom example that’s slowing things down, can probably do something about that too.
Let me know what you ran exactly, because if slowness is coming from not using LAPACK (like macOS) and you managed to get intel MKL to compile: we can do something nice for Linux to use LAPACK. Can guide you through what to change and if we notice it’s a lot faster I’ll add conditional dependencies for Linux in the project
Hi @azmyrajab
I did not modify any code, just run it directly.
Hi @wukan1986 can you please do me a favour and try the following code changes -
In your cargo.toml - modify to this block which handles a conditional dependency on Intel MKL ndarray-linalg for linux
[target.'cfg(target_os = "linux")'.dependencies]
jemallocator = { version = "0.5", features = ["disable_initial_exec_tls"] }
ndarray = { version = "*", features = ["matrixmultiply-threading"]}
ndarray-linalg = {version = "*", features = ["intel-mkl"]}
[target.'cfg(target_os = "macos")'.dependencies]
ndarray = { version = "*", features = ["blas"]}
ndarray-linalg = { version = "*" }
[target.'cfg(not(any(target_os = "macos", target_os = "linux")))'.dependencies]
ndarray = { version = "*", features = ["matrixmultiply-threading"]} # i.e. windows
Then, in least_squares.rs
, modify the imports so that both macos and linux try to import the Lapack SVD trait
#[cfg(any(target_os = "macos", target_os = "linux"))]
use ndarray_linalg::LeastSquaresSvd;
And finally modify the block of functions defining solve_ols_svd
which map to the conditional dependencies we define to
#[cfg(not(any(target_os = "macos", target_os = "linux")))]
fn solve_ols_svd(y: &Array1<f64>, x: &Array2<f64>, rcond: Option<f64>) -> Array1<f64> {
// TODO: try to compute w/ LAPACK SVD. Must handle BLAS dependency on linux & windows OS
// either use ndarray-linalg or directly call sgelsd from lapack crate..
solve_ridge_svd(y, x, 1.0e-64, rcond) // near zero ridge penalty
}
#[cfg(any(target_os = "macos", target_os = "linux"))]
#[allow(unused_variables)]
fn solve_ols_svd(y: &Array1<f64>, x: &Array2<f64>, rcond: Option<f64>) -> Array1<f64> {
x.least_squares(y)
.expect("Failed to compute LAPACK SVD solution!")
.solution
}
That should hopefully allow the faster LAPACK method on your system. Obviously if what you do works, I'll try to upstream it into next release - although a bit tricker on my end as I have to deal with linux on many platforms and intel mkl did not run smoothly on all when I last tested it......
Then could you pls run maturin develop --release
and re-run your benchmark then? And let me know what you see?
@wukan1986 - or you could just git check out the latest, added some changes quickly there and they seem to pass CI tests
I see roughly 6-8x speed-ups on CI tests of linux old vs new benchmarking to a prior CI run
benchmark_least_squares_svd: Mean +- std dev: 8.49 ms +- 0.09 ms (CI x86_64 linux old)
benchmark_least_squares_svd: Mean +- std dev: 1.32 ms +- 0.01 ms (CI x86_64 linux new)
Let me know what you see whenever you get around to it
it's is history:
769 rm -rf polars_ols/
770 git clone --depth=1 https://github.com/azmyrajab/polars_ols.git
771 ls
772 cd polars_ols
773 code .
774 maturin develop --release
775 python tests/benchmark.py --quiet --fast
776 history
the output is:
# new
(py311) kan@DESKTOP-7DFCADR:~/test1/polars_ols$ python tests/benchmark.py --quiet --fast
benchmark_least_squares_svd: Mean +- std dev: 452 us +- 9 us
benchmark_least_squares_numpy_svd: Mean +- std dev: 377 us +- 62 us
Compared to before, it's much faster
# old
(py311) kan@DESKTOP-7DFCADR:~/test1/polars_ols$ python tests/benchmark.py --quiet --fast
benchmark_least_squares_svd: Mean +- std dev: 521 us +- 8 us
benchmark_least_squares_numpy_svd: Mean +- std dev: 359 us +- 52 us
Interesting, thanks for sharing that. For me, on macOS, it runs faster than numpy maybe because of BLAS backend there - but at least we moved in the right direction.
I have a few more ideas for more speed up (and removing your Intel MKL dependency), but they will take probably a few days to get done correctly.
Leave you with this configuration in the meantime and keep you posted once I have something ready for you to try next !
Hi @wukan1986 - some progress here: part 1 of trying to make things faster https://github.com/azmyrajab/polars_ols/pull/10
I measured runtime carefully and noticed a significant proportion of time was spent in the thin python layer before the actual rust functions are invoked. For example casting columns to Float64 and applying targets / sqrt_w (even when sqrt_w is 1.) was adding some overhead. So this PR moves all of that to rust which speeds things up slightly (roughly ~7% in my benchmarks)
Thank you, I tested it and it's a bit faster
(py311) kan@DESKTOP-7DFCADR:~/test1/polars_ols$ python tests/benchmark.py --quiet --fast
benchmark_least_squares_svd: Mean +- std dev: 422 us +- 17 us
benchmark_least_squares_numpy_svd: Mean +- std dev: 341 us +- 60 us
Hi @wukan1986 - have made a few updates since and tried my best to speed things up for linux w/o too much platform specific optimisation. I think have reached the point where most of the bottleneck measured is the LAPACK DGELSD fortran code path which is something I can't optimize over outside of changing what acceleration library the system links LAPACK to. For linux polars-ols
now links LAPACK to intel-mkl which should be a decent option.
Numpy typically links to either openblas or intel-mkl on most distros for reference (run np.show_config()
to check).
On my EC2 linux test machine (so intel mkl), SVD, is roughly 20% faster than np.linalg.lstsq
under rigorous
benchmark.
The github CI server [linux benchmark] (https://github.com/azmyrajab/polars_ols/actions/runs/8786320824/job/24108926549) (2000 samples, 5 features) shows around ~11% improvement on np.linalg.lstsq:
venv/bin/python tests/benchmark.py --quiet --rigorous
benchmark_least_squares_qr: Mean +- std dev: 472 us +- 11 us
benchmark_least_squares_svd: Mean +- std dev: 460 us +- 5 us
benchmark_least_squares_numpy_qr: Mean +- std dev: 490 us +- 101 us
benchmark_least_squares_numpy_svd: Mean +- std dev: 519 us +- 106 us
my Macbook - which uses an apple's 'accelerate' framework (2000 samples, 5 features); it's roughly ~33% faster:
(venv) azmy@myhost polars_ols % python tests/benchmark.py --quiet --rigorous
benchmark_least_squares_qr: Mean +- std dev: 256 us +- 3 us
benchmark_least_squares_svd: Mean +- std dev: 309 us +- 4 us
benchmark_least_squares_numpy_qr: Mean +- std dev: 518 us +- 201 us
benchmark_least_squares_numpy_svd: Mean +- std dev: 464 us +- 150 us
My Macbook - but with larger data (10_000 samples, 100 features) gap tightens to ~15% faster:
(venv) azmy@myhost polars_ols % python tests/benchmark.py --quiet --rigorous
benchmark_least_squares_qr: Mean +- std dev: 17.2 ms +- 0.7 ms
benchmark_least_squares_svd: Mean +- std dev: 23.8 ms +- 0.5 ms
benchmark_least_squares_numpy_qr: Mean +- std dev: 51.8 ms +- 18.4 ms
benchmark_least_squares_numpy_svd: Mean +- std dev: 28.1 ms +- 8.3 ms
Hopefully you see something at least in line with numpy whenever you get a chance to re-run. In the meantime will close the issue for now, but let me know if you see something odd
Thank you for your job.
After doing a lot of null processing, the speed is much slower than numpy. Can we speed it up?