azmyrajab / polars_ols

Polars least squares extension - enables fast linear model polar expressions
MIT License
102 stars 9 forks source link

Can we speed it up? #7

Closed wukan1986 closed 5 months ago

wukan1986 commented 5 months ago

Thank you for your job.

After doing a lot of null processing, the speed is much slower than numpy. Can we speed it up?

(py311) kan@DESKTOP-7DFCADR:~/test1/polars_ols$ python tests/benchmark.py --quiet --fast
benchmark_least_squares_svd: Mean +- std dev: 521 us +- 8 us
benchmark_least_squares_numpy_svd: Mean +- std dev: 359 us +- 52 us
wukan1986 commented 5 months ago

in https://github.com/rust-ndarray/ndarray-linalg I can run intel-mkl directly, but after struggling for half an hour, I still can't run openblas

cargo run --example solve --features=intel-mkl
cargo test --features=intel-mkl
azmyrajab commented 5 months ago

Hi @wukan1986, thanks

when you ran the benchmark did you modify any code to add nulls and change the null policy, or did you run as-is?

Because that original code should do no null handling (ignore) and I wrote the code in a way that introduced little overhead in case null_policy is Ignore.

Assuming that’s what you ran too, then the slowness is due to the code path on Linux using my own SVD solver instead of the optimized LAPACK one.

If it’s the NaN handling that you tested in a custom example that’s slowing things down, can probably do something about that too.

Let me know what you ran exactly, because if slowness is coming from not using LAPACK (like macOS) and you managed to get intel MKL to compile: we can do something nice for Linux to use LAPACK. Can guide you through what to change and if we notice it’s a lot faster I’ll add conditional dependencies for Linux in the project

wukan1986 commented 5 months ago

Hi @azmyrajab

I did not modify any code, just run it directly.

azmyrajab commented 5 months ago

Hi @wukan1986 can you please do me a favour and try the following code changes -

In your cargo.toml - modify to this block which handles a conditional dependency on Intel MKL ndarray-linalg for linux

[target.'cfg(target_os = "linux")'.dependencies]
jemallocator = { version = "0.5", features = ["disable_initial_exec_tls"] }
ndarray = { version = "*", features = ["matrixmultiply-threading"]}
ndarray-linalg = {version = "*", features = ["intel-mkl"]}

[target.'cfg(target_os = "macos")'.dependencies]
ndarray = { version = "*", features = ["blas"]}
ndarray-linalg = { version = "*" }

[target.'cfg(not(any(target_os = "macos", target_os = "linux")))'.dependencies]
ndarray = { version = "*", features = ["matrixmultiply-threading"]}  # i.e. windows

Then, in least_squares.rs, modify the imports so that both macos and linux try to import the Lapack SVD trait

#[cfg(any(target_os = "macos", target_os = "linux"))]
use ndarray_linalg::LeastSquaresSvd;

And finally modify the block of functions defining solve_ols_svd which map to the conditional dependencies we define to

#[cfg(not(any(target_os = "macos", target_os = "linux")))]
fn solve_ols_svd(y: &Array1<f64>, x: &Array2<f64>, rcond: Option<f64>) -> Array1<f64> {
    // TODO: try to compute w/ LAPACK SVD. Must handle BLAS dependency on linux & windows OS
    //      either use ndarray-linalg or directly call sgelsd from lapack crate..
    solve_ridge_svd(y, x, 1.0e-64, rcond) // near zero ridge penalty
}

#[cfg(any(target_os = "macos", target_os = "linux"))]
#[allow(unused_variables)]
fn solve_ols_svd(y: &Array1<f64>, x: &Array2<f64>, rcond: Option<f64>) -> Array1<f64> {
    x.least_squares(y)
        .expect("Failed to compute LAPACK SVD solution!")
        .solution
}

That should hopefully allow the faster LAPACK method on your system. Obviously if what you do works, I'll try to upstream it into next release - although a bit tricker on my end as I have to deal with linux on many platforms and intel mkl did not run smoothly on all when I last tested it......

Then could you pls run maturin develop --release and re-run your benchmark then? And let me know what you see?

azmyrajab commented 5 months ago

@wukan1986 - or you could just git check out the latest, added some changes quickly there and they seem to pass CI tests

I see roughly 6-8x speed-ups on CI tests of linux old vs new benchmarking to a prior CI run

benchmark_least_squares_svd: Mean +- std dev: 8.49 ms +- 0.09 ms (CI x86_64 linux old)
benchmark_least_squares_svd: Mean +- std dev: 1.32 ms +- 0.01 ms  (CI x86_64 linux new)

Let me know what you see whenever you get around to it

wukan1986 commented 5 months ago

it's is history:

  769  rm -rf polars_ols/
  770  git clone --depth=1 https://github.com/azmyrajab/polars_ols.git
  771  ls
  772  cd polars_ols
  773  code .
  774  maturin develop --release
  775  python tests/benchmark.py --quiet --fast
  776  history

the output is:

# new
(py311) kan@DESKTOP-7DFCADR:~/test1/polars_ols$ python tests/benchmark.py --quiet --fast
benchmark_least_squares_svd: Mean +- std dev: 452 us +- 9 us
benchmark_least_squares_numpy_svd: Mean +- std dev: 377 us +- 62 us

Compared to before, it's much faster

# old
(py311) kan@DESKTOP-7DFCADR:~/test1/polars_ols$ python tests/benchmark.py --quiet --fast
benchmark_least_squares_svd: Mean +- std dev: 521 us +- 8 us
benchmark_least_squares_numpy_svd: Mean +- std dev: 359 us +- 52 us
azmyrajab commented 5 months ago

Interesting, thanks for sharing that. For me, on macOS, it runs faster than numpy maybe because of BLAS backend there - but at least we moved in the right direction.

I have a few more ideas for more speed up (and removing your Intel MKL dependency), but they will take probably a few days to get done correctly.

Leave you with this configuration in the meantime and keep you posted once I have something ready for you to try next !

azmyrajab commented 5 months ago

Hi @wukan1986 - some progress here: part 1 of trying to make things faster https://github.com/azmyrajab/polars_ols/pull/10

I measured runtime carefully and noticed a significant proportion of time was spent in the thin python layer before the actual rust functions are invoked. For example casting columns to Float64 and applying targets / sqrt_w (even when sqrt_w is 1.) was adding some overhead. So this PR moves all of that to rust which speeds things up slightly (roughly ~7% in my benchmarks)

wukan1986 commented 5 months ago

Thank you, I tested it and it's a bit faster

(py311) kan@DESKTOP-7DFCADR:~/test1/polars_ols$ python tests/benchmark.py --quiet --fast
benchmark_least_squares_svd: Mean +- std dev: 422 us +- 17 us
benchmark_least_squares_numpy_svd: Mean +- std dev: 341 us +- 60 us
azmyrajab commented 5 months ago

Hi @wukan1986 - have made a few updates since and tried my best to speed things up for linux w/o too much platform specific optimisation. I think have reached the point where most of the bottleneck measured is the LAPACK DGELSD fortran code path which is something I can't optimize over outside of changing what acceleration library the system links LAPACK to. For linux polars-ols now links LAPACK to intel-mkl which should be a decent option. Numpy typically links to either openblas or intel-mkl on most distros for reference (run np.show_config() to check).

On my EC2 linux test machine (so intel mkl), SVD, is roughly 20% faster than np.linalg.lstsq under rigorous benchmark.

The github CI server [linux benchmark] (https://github.com/azmyrajab/polars_ols/actions/runs/8786320824/job/24108926549) (2000 samples, 5 features) shows around ~11% improvement on np.linalg.lstsq:

venv/bin/python tests/benchmark.py --quiet --rigorous
benchmark_least_squares_qr: Mean +- std dev: 472 us +- 11 us
benchmark_least_squares_svd: Mean +- std dev: 460 us +- 5 us
benchmark_least_squares_numpy_qr: Mean +- std dev: 490 us +- 101 us
benchmark_least_squares_numpy_svd: Mean +- std dev: 519 us +- 106 us

my Macbook - which uses an apple's 'accelerate' framework (2000 samples, 5 features); it's roughly ~33% faster:

(venv) azmy@myhost polars_ols % python tests/benchmark.py --quiet --rigorous           
benchmark_least_squares_qr: Mean +- std dev: 256 us +- 3 us
benchmark_least_squares_svd: Mean +- std dev: 309 us +- 4 us
benchmark_least_squares_numpy_qr: Mean +- std dev: 518 us +- 201 us
benchmark_least_squares_numpy_svd: Mean +- std dev: 464 us +- 150 us

My Macbook - but with larger data (10_000 samples, 100 features) gap tightens to ~15% faster:

(venv) azmy@myhost polars_ols % python tests/benchmark.py --quiet --rigorous
benchmark_least_squares_qr: Mean +- std dev: 17.2 ms +- 0.7 ms
benchmark_least_squares_svd: Mean +- std dev: 23.8 ms +- 0.5 ms
benchmark_least_squares_numpy_qr: Mean +- std dev: 51.8 ms +- 18.4 ms
benchmark_least_squares_numpy_svd: Mean +- std dev: 28.1 ms +- 8.3 ms

Hopefully you see something at least in line with numpy whenever you get a chance to re-run. In the meantime will close the issue for now, but let me know if you see something odd