LaihoE / SIMD-itertools

Faster implementations of standard library operations like find, filter, position etc.
168 stars 3 forks source link

Should SIMD_LEN be configurable? #11

Open LaihoE opened 4 months ago

LaihoE commented 4 months ago

Currently SIMD_LEN is set to 32 for everything. Should we just let users pass it for every method? Something like this:

arr.iter().all_equal::<32>()
PaulDotSH commented 4 months ago

Currently SIMD_LEN is set to 32 for everything. Should we just let users pass it for every method? Something like this:

arr.iter().all_equal::<32>()

AVX512 is 512bit, shouldn't it be different depending on what the data type length is?

smu160 commented 4 months ago

This is a great point!

I've tried SIMD_LEN = 64 and SIMD_LEN = 32 just to see the difference. One thing I noticed on godbolt is that unless SIMD_LEN is set to 64, generated asm shows no use of AVX-512. This is even when I set -C target-cpu=znver4 or the more general case of -C target-cpu=x86-64-v4.

You can try it out here -- just modify the SIMD_LEN const at the top. The number of cycles doesn't seem to change (according to LLVM MCA), but the proof is always in the pudding.

I have a 7950x (which has great AVX-512 performance, but still "double-pumped". I'll run benches to see how it looks. I'll just post the criterion plots for comparison.

If needed, there's probably several ways to choose SIMD_LEN size. Perhaps add a build.rs to check for AVX-512 support?

LaihoE commented 4 months ago

Currently SIMD_LEN is set to 32 for everything. Should we just let users pass it for every method? Something like this:

arr.iter().all_equal::<32>()

AVX512 is 512bit, shouldn't it be different depending on what the data type length is?

probably, not sure

LaihoE commented 4 months ago

This is a great point!

I've tried SIMD_LEN = 64 and SIMD_LEN = 32 just to see the difference. One thing I noticed on godbolt is that unless SIMD_LEN is set to 64, generated asm shows no use of AVX-512. This is even when I set -C target-cpu=znver4 or the more general case of -C target-cpu=x86-64-v4.

You can try it out here -- just modify the SIMD_LEN const at the top. The number of cycles doesn't seem to change (according to LLVM MCA), but the proof is always in the pudding.

I have a 7950x (which has great AVX-512 performance, but still "double-pumped". I'll run benches to see how it looks. I'll just post the criterion plots for comparison.

If needed, there's probably several ways to choose SIMD_LEN size. Perhaps add a build.rs to check for AVX-512 support?

Interesting. It seems to also generate avx-512 when you up the datatype size. When I up it to u32's it also generates avx-512 with smaller SIMD_LEN.

I originally picked the 32 based on benchmarks on my machine (5900x no avx-512).

PaulDotSH commented 4 months ago

This is a great point!

I've tried SIMD_LEN = 64 and SIMD_LEN = 32 just to see the difference. One thing I noticed on godbolt is that unless SIMD_LEN is set to 64, generated asm shows no use of AVX-512. This is even when I set -C target-cpu=znver4 or the more general case of -C target-cpu=x86-64-v4.

You can try it out here -- just modify the SIMD_LEN const at the top. The number of cycles doesn't seem to change (according to LLVM MCA), but the proof is always in the pudding.

I have a 7950x (which has great AVX-512 performance, but still "double-pumped". I'll run benches to see how it looks. I'll just post the criterion plots for comparison.

If needed, there's probably several ways to choose SIMD_LEN size. Perhaps add a build.rs to check for AVX-512 support?

I have a very weak machine that overheats and doesn't produce reproducible results, would you mind testing if the data type size matters for the SIMD_LEN variable too?

smu160 commented 4 months ago

@PaulDotSH Sure, no problem!

smu160 commented 4 months ago

I originally picked the 32 based on benchmarks on my machine (5900x no avx-512).

Yea, I figured. I saw you mentioned avx2 on the benchmarks plot. Perhaps we can first figure out the best way to set this SIMD_LEN. Do we go with runtime detection? Compile time?

Now that I think of it... perhaps we just need to benchmark. It may be safe to just use 64 in all cases, but the proof is in the pudding. I'll run the benchmarks.

okaneco commented 3 months ago

+1 for letting users set the constant themselves.

The constant choice is probably going to be a function of the bit-width for the data type you're operating on, the target's vector register width, and LLVM's heuristics for code generation among other things, rather than a general constant. Users will likely want to use different constants whether they're targeting baseline SSE2 vs AVX2 vs AVX-512. This will make it easier to benchmark and tailor for individual use cases as desired codegen can be fiddly.

LLVM seems to pessimize performance when the constant is too large for the target's max vector width for x64. Playing with chunks_exact and larger powers-of-2 shows the kind of degradation in code generation.

From my experience with targeting the Rust baseline (SSE2), the rule of thumb for chunk size tends to be 256-bits or 2x the 128-bit SSE SIMD register size as shown in the table below. Beyond these sizes, the performance tends to degrade.

Integer size SSE SIMD_LEN
u8 32
u16 16
u32/f32 8

I've seen similar behavior for AVX2 with chunk size targeting some multiple of YMM register width before it gets suboptimally unrolled, but I have no experience with AVX-512. Of course, these numbers should be verified with measurement and depend on the corresponding vector instruction existing for the data type size. The practical length choices are probably going to lean more towards 32 and 64 than 8 or 16 but it's worth being able to reach for that when needed.

LaihoE commented 3 months ago

Now that we have multiversioning and know the target max width and the type width maybe we could just pick a good len for the user? I found: https://docs.rs/target-features/latest/target_features/struct.Target.html#method.suggested_simd_width

But I'm not quite sure how to make it work with as_simd() as it expects a const and I'm not able to get it to work with suggested_simd_width(). Is this kind of thing possible?

const BEST_LEN: usize = selected_target!().suggested_simd_width::<T>().unwrap();
let (prefix, simd_data, suffix) = arr.as_simd::<BEST_LEN>();

Now I can get it to work with chunks_exact using let instead:

let best_len = selected_target!().suggested_simd_width::<T>().unwrap();
for chunk in arr.chunks_exact(best_len){
...
}

But the performance of as_simd is significantly better so I don't really want to with this.

PaulDotSH commented 3 months ago

Now that we have multiversioning and know the target max width and the type width maybe we could just pick a good len for the user? I found: https://docs.rs/target-features/latest/target_features/struct.Target.html#method.suggested_simd_width

But I'm not quite sure how to make it work with as_simd() as it expects a const and I'm not able to get it to work with suggested_simd_width(). Is this kind of thing possible?

const BEST_LEN: usize = selected_target!().suggested_simd_width::<T>().unwrap();
let (prefix, simd_data, suffix) = arr.as_simd::<BEST_LEN>();

Now I can get it to work with chunks_exact using let instead:

let best_len = selected_target!().suggested_simd_width::<T>().unwrap();
for chunk in arr.chunks_exact(best_len){
...
}

But the performance of as_simd is significantly better so I don't really want to with this.

We could have a function that can take a const from the user, and one that doesn't and is a simple wrapper with #[inline(always)] and calls the other function with a default value, maybe depending on the type if there's a performance benefit.