datafusion-contrib / datafusion-python

Python binding for DataFusion
https://arrow.apache.org/datafusion/python/index.html
Apache License 2.0
59 stars 12 forks source link

Consider compiling with newer CPU flags #29

Open Dandandan opened 2 years ago

Dandandan commented 2 years ago

Rust by default compiles towards a very old architecture, which limit the performance of the.

We should probably update this with a newer An example of Polars usage:

https://github.com/pola-rs/polars/blob/master/.github/deploy_manylinux.sh#L11

There are a bit of stats over here:

https://store.steampowered.com/hwsurvey

SSE2100.00% SSE3100.00% LAHF / SAHF99.99% CMPXCHG16B99.98% SSSE399.27% SSE4.198.89% SSE4.298.50% FCMOV97.23% NTFS96.06% AES95.50% AVX94.38% AVX286.31%

I think we could maybe enable all features up to avx2 and AES. AES is in use by ahash which will improve performance in hash joins and hash aggregates. Other features improve overall performance, e.g. in kernels, parquet reader, and DataFusion code.

matthewmturner commented 2 years ago

ill play with these flags locally and keep you posted on impact

matthewmturner commented 2 years ago

@Dandandan

I've done the following to build the wheel:

export RUSTFLAGS='-C target-feature=+fxsr,+sse,+sse2,+sse3,+ssse3,+sse4.1+sse4.2,+popcnt,+aes,+avx,+avx2' && maturin build --release

Then i just reinstalled the wheel and reran the benchmark which produced the following:

q1: 0.043521209000000116
q2: 0.4907338750000001
q3: 2.0281409170000004
q4: 0.03750329200000024
q5: 2.112818584
q6: 2.1120300420000007
q7: 2.0400456249999994
q8: 3.093032082999999
q9: 2.1041081250000016
q10: 50.334135208999996

These results were basically in line with the unoptimized build so im wondering if ive done something wrong.

any thoughts?

matthewmturner commented 2 years ago

@realno FYI

houqp commented 2 years ago

When I tried target-cpu=skylake for roapi, i got 10-20% speed improvements. Just as a quick test, do you get any performance gain with target-cpu=native?

matthewmturner commented 2 years ago

below is with native and sn-malloc - some faster, some slower. roughly in line.

q1: 0.05099512500000003
q2: 0.3307659999999999
q3: 1.228696541
q4: 0.062102542000000316
q5: 1.2268319589999996
q6: 1.2571589580000002
q7: 1.1611415420000002
q8: 2.9696968339999996
q9: 0.6929859999999994
q10: 20.191931167