Will simd be introduced？

TheLudlows commented 4 years ago

hello, Will simd be introduced?

saik0 commented 2 years ago

I took at look at portable SIMD. While it does define the packed SIMD types, some of the instructions used by the algs presented in the paper are target feature specific (such as the sse4.2 cmpistrm)

saik0 commented 2 years ago

Another consideration is that the vectorized versions (at least for the two I've looked at, OR and AND) are not safe to use in-place (|=, &=).

A nice feature we have that rust makes trivial is the ability to take ownership of another RoaringBitmap's container when the right hand side of an operation is to an owned RoaringBitmap. For the context of this discussion I will refer to this as "container recycling"

Vectorization and container recycling are seemingly at odds. When performing an AND, should we use the faster vectorized alg that must allocate, or recycle the allocation and go with the slower one.

Or is there another option? Thinking out loud: We could keep a thread local buffer around, always use the vectorized version (when it's possible, and given it's the best choice). Then swap the pointers. The output vector would then never alias the same memory as either of the two input vectors.

saik0 commented 2 years ago

Ported the vectorized intersection alg from CRoaring, and tried using it for &= using the thread local / swap pattern I described.

Fun. I haven't convinced myself it's a good idea, but it sure is fast

Screen Shot 2022-01-19 at 2 04 20 AM

Given the SIMD alg requires specific x86 features, I'm not at all sure how to go about testing.

Curious what you think @Kerollmops, @Nemo157

saik0 commented 2 years ago

SIMD stuff in my very messy very experimental branch

https://github.com/saik0/roaring-rs/blob/594cf85a67b640daa91de6af1da395b3e2dc6332/src/bitmap/store/op_vector.rs

Kerollmops commented 2 years ago

Wow, is it me or is it way faster than the C equivalent? 🚤

Given the SIMD alg requires specific x86 features, I'm not at all sure how to go about testing.

This is one of the reasons why I am not sure either, do we really want to bring raw SIMD algorithms into this crate? Can't we rely on the soon-to-be-released stdSIMD module?

SIMD stuff in my very messy very experimental branch

That's impressive, the speed is quite good and if I'm not wrong, even faster than C but... have you seen the complexity of it? The amount of code we will need to maintain just for array-array operations and on the x86 only platform. Do we want to maintain these 1095 of raw SIMD functions? Maybe that's fair and we effectively want/need that 😄

saik0 commented 2 years ago

Wow, is it me or is it way faster than the C equivalent? 🚤

The C impl only uses the vectorized version for & but not &= because it's not safe to use in place.

So I got around that by keeping a thread local array to use as a buffer, then swapping them. Reminds me of a shell game

https://github.com/saik0/roaring-rs/blob/594cf85a67b640daa91de6af1da395b3e2dc6332/src/bitmap/store/array_store.rs#L378

saik0 commented 2 years ago

This is one of the reasons why I am not sure either, do we really want to bring raw SIMD algorithms into this crate? Can't we rely on the soon-to-be-released stdSIMD module?

AFACT the std simd crate will only provide data type definitions and ops to load, store, mask, and swizzle https://doc.rust-lang.org/nightly/std/simd/index.html

I called out the lack of some x86 specific instructions like cmpistrm in one of my other comments.

saik0 commented 2 years ago

Screen Shot 2022-01-19 at 3 53 30 AM

Maybe it's possible to write a more less performant but more generic SIMD intersection alg that only uses mask / shuffle. I'd be lying if I said I really understood how it works.

Kerollmops commented 2 years ago

The C impl only uses the vectorized version for & but not &= because it's not safe to use in place.

So I got around that by keeping a thread local array to use as a buffer, then swapping them. Reminds me of a shell game

Have you tried to just always allocate the buffer instead of using a thread-local variable? I would be glad if we can avoid using that, I am sure that malloc is fast enough to generate an uninitialized memory area for us.

Maybe it's possible to write a more less performant but more generic SIMD intersection alg that only uses mask / shuffle. I'd be lying if I said I really understood how it works.

Maybe we can keep the work you have done and make sure that it doesn't impact the compilation on other platforms e.g. ARM, I am sure that's what you thought of! The thing that makes MeiliSearch use RoaringBitmap and not the CRoaring crate is because it compiles on all the platforms I tried. I would at least like it to continue in this way.

A good way to be sure that the codebase is portable could be to compile it for wasm-unknown-unknown. I am ok to accept more complexity in the crate if we make it way faster than it is RN and that's the case with your experiments 😃

saik0 commented 2 years ago

Have you tried to just always allocate the buffer instead of using a thread-local variable? I would be glad if we can avoid using that, I am sure that malloc is fast enough to generate an uninitialized memory area for us.

I dont think this in in the "spirit" of &=

If were going to alloc/free a bunch of temporary vecs, what makes it different from

let a = RoaringBitmap::arbitrary();
let b = RoaringBitmap::arbitrary();
let a = a & &b;

saik0 commented 2 years ago

Also: memory fragmentation.

saik0 commented 2 years ago

I suppose in the case of & and - we could stack allocate 8k then memcpy into the vec, as the upper bound on the cardinality of the result is max(card_lhs, card_rhs). With | and xor its card_lhs + card_rhs so we'd need 16k, or to be clever in ways that would slow down the hot code path. For example: abort the array-array when the buffer would overflow signaling to the caller to create a bitmap.

saik0 commented 2 years ago

pub fn intersect_assign_vector_stack(lhs: &mut Vec<u16>, rhs: &[u16]) {
    let mut buf: [u16; 4096] = [0; 4096];
    let len = unsafe {
        intersect_vector16(
            lhs.as_ptr(),
            lhs.len(),
            rhs.as_ptr(),
            rhs.len(),
            buf.as_mut_ptr(),
        )
    };
    lhs.as_mut_slice()[..len].copy_from_slice(&buf[..len]);
    lhs.truncate(len);
}

saik0 commented 2 years ago

Also: memory fragmentation.

No. It's dropped after use. It wont fragment memory. 😄

I am sure that malloc is fast enough to generate an uninitialized memory area for us.

In my informal testing, comparing runtime:

malloc < stack+memcpy < thread local

Thinking out loud some more:

We dont generally operate on containers in isolation, it could be a worthwhile future optimization to "weave" an array container though the op over many containers, swapping them as we go. :)

Also, I wonder why CRoaring isn't doing this.

Do we want to maintain these 1095 of raw SIMD functions?

To be fair, most of it is LUTs. Also, this is only for 2 of the 4 ops.

saik0 commented 2 years ago

Maybe we can keep the work you have done and make sure that it doesn't impact the compilation on other platforms e.g. ARM, I am sure that's what you thought of! The thing that makes MeiliSearch use RoaringBitmap and not the CRoaring crate is because it compiles on all the platforms I tried. I would at least like it to continue in this way.

Yes, although this experimental code is wildly unsafe and assumes platform and required features, it's not too much effort to conditionally compile. With a medium level of effort, we could also add runtime detection. Otherwise, without specifying target-features or target-cpu in the build, none of this will be compiled

Kerollmops commented 2 years ago

In my informal testing, comparing runtime: malloc < stack+memcpy < thread local

Do you have some benchmark graphs? Just to see the impact of using your stack+memcpy technique compared to the thread-local one?

About the thread-local technique: I am worried about platform requirements i.e. wasm and this is why I would prefer we use simple and common techniques to make sure the crate compiles everywhere. As you can see in the documentation of the thread-local LocalKey type, there even is a section explaining that sometimes destructors are not run 😨 Even if it is not quite an issue because even if the Vec inside isn't run, the program stops and the OS frees all the memory anyway.

With a medium level of effort, we could also add runtime detection.

I like that idea, the only downside of it is code complexity.

To be fair, most of it is LUTs. Also, this is only for 2 of the 4 ops.

What do you mean by LUTs, here?

With | and xor its card_lhs + card_rhs so we'd need 16k, or to be clever in ways that would slow down the hot code path. For example: abort the array-array when the buffer would overflow signaling to the caller to create a bitmap.

Hum... you are right! If we can find a way to abort the operation when it detects that there are too many values we can keep a small stack size and continue with a bitmap, indeed. But that would be a brand new algorithm to create.

Maybe we can start by using this thread_local! macro everywhere we need it and remove it when we are convenient enough with a new way to compute thing without this commodity.

saik0 commented 2 years ago

In my informal testing, comparing runtime: malloc < stack+memcpy < thread local

Do you have some benchmark graphs? Just to see the impact of using your stack+memcpy technique compared to the thread-local one?

About the thread-local technique: I am worried about platform requirements i.e. wasm and this is why I would prefer we use simple and common techniques to make sure the crate compiles everywhere. As you can see in the documentation of the thread-local LocalKey type, there even is a section explaining that sometimes destructors are not run 😨 Even if it is not quite an issue because even if the Vec inside isn't run, the program stops and the OS frees all the memory anyway.

< being better, meaning less runtime. In other words: you are correct, malloc/free is faster! I should have known better than to try to outsmart jemalloc.

Hum... you are right! If we can find a way to abort the operation when it detects that there are too many values we can keep a small stack size and continue with a bitmap, indeed. But that would be a brand new algorithm to create.

We would effectively be putting bounds checks back in, albeit once per loop iter, rather than on every memory access . I'm not convinced it would be worth the complexity for the small incremental gain.

What do you mean by LUTs, here?

Look Up Tables

It's about ~325 LoC for union and intersect algs, and ~700 lines of LUTs

saik0 commented 2 years ago

I looked into what it would take to translate the vectorized ops to std::simd

Or, Xor

There are some instructions that are not supported by std::simd.

Vectorized min and max
- Min can be trivially implemented in terms of std::simd by doing a.lanes_le(a).select(a, b). llvm will optimize this to min for the target instruction set when a min instruction is available. Amazing. [1]
- For max replace lanes_le with lanes_gt
Vector mask -> scalar mask, dynamic swizzle
- No bitmask support (yet). std::simd only has swizzle by static vector. Both of these will likely be implemented by std::simd eventually. For now we can write functions that utilize conditional compilation to call the appropriate arch intrinsic when the instruction is available, and fallback to a scalar implementation when none is available. [2]
- If std::simd adds the methods we can migrate to them. [3] [4]

[1] Here is an example of sse_merge function from CRoaring transted to std::simd. Compare the x86 asm to C source 🎉 [2] bitmask and swizzle shims [3] swizzle issue in std::simd [4] bitmask issue in std::simd

Once we have these basic building blocks it would be relatively easy to translate CRoaring union_vector16 to std::simd and use slices for inputs and output rather than passing raw pointers around.

And, Sub

Also needs dynamic swizzle.
cmpistrm
- I dont think any other platforms have an equivalent instr
- We could look into implementing a less performant vectorized algo using portable simd instrs, or provide shim for it

Summary

Or and Xor can be vectorized mostly with std::simd if we provide a couple of shims for functions it does not provide
Further research required for And, Sub

Discussion

Would we like to start a discussion re: whether or not we want to add SIMD to the project, plus which set of SIMD primitives we should be coding against, or should we wait until we have a better understanding of what's required for And, Sub.

saik0 commented 2 years ago

Got it. Implemented how cmpistrm was used in terms of or and shift. It compiles to about ~ 30 instrs but cmpistrm is a low throughput instruction. On my hardware, it benchmarked the same when I replaced it with the or-shift.

It should be easy now to replace the rest of the alg with std simd. The rest is just the "machinery" that does the looping, loads, and stores.

Screen Shot 2022-01-21 at 11 51 28 PM

Kerollmops commented 2 years ago

Wow! You are such a beast! That's amazing 🎉 🏎️

According to the benchmarks you sent we are even with the C and x86_simd version with std::simd, does it means that your algorithm is maybe easier to read but also compatible with more targets than just x86 i.e. ARM, wasm?

being better, meaning less runtime. In other words: you are correct, malloc/free is faster! I should have known better than to try to outsmart jemalloc.

Does the pairwise_and_assign implementation in this last benchmark simply call jemalloc now? Or does it use thread_local!?

[...] whether or not we want to add SIMD to the project, plus which set of SIMD primitives we should be coding against [...]

I think we, and more specifically you, found the best compromise: Using std::simd as much as possible, this should bring more control to the codebase i.e. easier to maintain/understand and be compatible with more targets than raw x86 unsafe intrinsics. If you are able to find a way to use std::simd for most parts of all the algorithms, we can concede to implement some functions by hand as you've done with cmpistrm, it seems to work great and is maintainable enough.

saik0 commented 2 years ago

According to the benchmarks you sent we are even with the C and x86_simd version with std::simd, does it means that your algorithm is maybe easier to read but also compatible with more targets than just x86 i.e. ARM, wasm?

It's actually slightly slower than the x86 intrinsics. I think it's worth it, given it runs on x86, ARM, and WASM.

Does the pairwise_and_assign implementation in this last benchmark simply call jemalloc now? Or does it use thread_local!?

Yes. No thread local in this bench.

we can concede to implement some functions by hand as you've done with cmpistrm, it seems to work great and is maintainable enough.

Here's what I've got so far.

Rough overview:

mod util contains some helpers that are implemented in terms of std::simd

mod shim contains some compatibility shims that cant be implemented in terms of std::simd (or at least, not yet)

Kerollmops commented 2 years ago

What's fun with computers is the different results you can get when benchmarking on different versions of a CPU, here is what I see when benchmarking on my iMac 2019 Intel Core i9 8 cores, 3,6 GHz. I don't know why but the c version doesn't seem to find the desired extensions to compile properly. I set the RUSTFLAGS="-C target-cpu=native" before benchmarking. Also, note that it is a home computer and that there is a lot of noise on the CPU.

Yes. No thread local in this bench.

This is great! This is so good!

Here's what I've got so far.

Rough overview:

mod util contains some helpers that are implemented in terms of std::simd

mod shim contains some compatibility shims that cant be implemented in terms of std::simd (or at least, not yet)

I looked at your code, that's indeed very good that with std::simd we will be able to use the same codebase for all of the targets. I tried out your branch on my MacBook Pro M1 (ARM) here are the results. I don't understand why the graphs are ugly though. Performances are not as good as x86 probably due to the places where NEON direct calls are missing, I guess.

saik0 commented 2 years ago

@Kerollmops

My translation workflow goes something like this

Translate C with intrinsics to rust with intrinsics
Replace the "default" op with x86 simd so I can run cargo test against it
If it fails, find where I messed up, otherwise move on
Copypasta the x86 code, this is the basis for std::simd version
Point the default op to the new copy
Replace the intrinsics with STD simd. Test.
Replace the pointer arithmetic with array slicing. Test.
Make it more idiomatic for rust. Test.
Done

TLDR: Be aware: in this branch for any given commit: "cur" might be anything, as it calls the default op, which I might have pointed to any given implementation for testing.

I've not gated x86 unsafe code with compile time checks. Your compiler might emit any garbage. I hope it does not eat your lunch. :)

saik0 commented 2 years ago

Super interesting that the rust beats C for scalar ARM. "opt_unsafe" is a pretty straightforward port of CRoaring scalar code.

My hypothesis: We have those two different implementations of intersect_skewed. Rust can statically guarantee to LLVM that the mutable ref will not alias the two immutable refs, which enables LLVM to do optimizations that it cant with C.

saik0 commented 2 years ago

Performances are not as good as x86 probably due to the places where NEON direct calls are missing, I guess.

That sounds about right. My high level plan

Translate all the SIMD ops.
Reorganize and tidy up
Feature gate the SIMD
Integrate with the main op entry points with conditional compilation on the feature gate

What I'm not at all sure of is how to integrate testing and CI. Qemu? 🤔

saik0 commented 2 years ago

Also undecided on bounds checks. The classic safety-performance-maintainability things in contention.

My intuition is to leave bounds checks in and later we can profile and look at the call stack, removing from only the hottest of hot code paths. Or perhaps we could come up with ways to express the algs in terms of iterators and bury the unsafe code in them.

I suspect there are still some lower-hanging fruit to optimize without sacrificing safety. I'll open some more issues. :)

Kerollmops commented 2 years ago

Also undecided on bounds checks. The classic safety-performance-maintainability things in contention.

My intuition is to leave bounds checks in and later we can profile and look at the call stack, removing from only the hottest of hot code paths. Or perhaps we could come up with ways to express the algs in terms of iterators and bury the unsafe code in them.

I think that we should indeed use as much safe code as possible. I, myself fall into this issue of trying to use too much unsafe and lose performances due to this. You can read this small story of implementing the slice_group_by methods with as much safe code as possible and gain performance compared to a fully unsafe implementation!

Sometimes, even putting assert!s in the right places helps the compiler a lot!

What I'm not at all sure of is how to integrate testing and CI. Qemu? 🤔

I looked at how the RoaringBitmap/CRoaring library was testing this in their CI and it seems like they test it on a Windows ARM machine. Is it a custom one? didn't check more. Maybe @lemire can help us on this subject?

saik0 commented 2 years ago

OK. Done with the initial impl + added proptests for xor. Going to run proptests on a generator that only creates array stores for a few hours. 🤞 😬 🤞

saik0 commented 2 years ago

I forgot to mention, I stripped out all the unsafe code around SIMD. (prior to test and bench)

It survived 4 hours of proptest without any failures or crashes.

Here is the full report of the op benchmarks. Note that unlike the Array-Array only benchmarks I shared earlier, this is for the full dataset and includes Array-Bitmap and Bitmap-Bitmap operations. It's surprisingly comparable to CRoaring in my environment. For reference I've got a Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz on a MacBook Pro. ROARING_ARCH=native RUSTFLAGS='-C target-cpu=native'

criterion.zip

I'm going to clean it all up and open a PR in the next few days.

Kerollmops commented 2 years ago

I just forget about that 🤦 but the current CI only checks that the program works on the stable channel of Rust, however, you introduce the use of std::simd and therefore can only be used on the nightly channel with the portable_simd feature. We must add a new job to the CI and bors to test on the nightly channel.

I would like to be able to use roaring-rs on the stable channel like before, would it be possible to introduce a new nightly, portable_simd or simply, simd feature to the Cargo.toml, that will enable your work. This feature would simply enable the portable_simd std feature.

// At the top of the lib.rs
#![cfg_attr(feature = "simd", feature(portable_simd))]

It's surprisingly comparable to CRoaring in my environment.

Yup, that's impressive how hard it is now to see the performance difference between the C and Rust code! 🏎️ You've done such an impressive amount of work! Could you send me an email, please? I can't find a contact channel on your profile.

saik0 commented 2 years ago

Yup, that's impressive how hard it is now to see the performance difference between the C and Rust code! 🏎️ You've done such an impressive amount of work! Could you send me an email, please? I can't find a contact channel on your profile.

Thanks. Sent you an email. =)

I made a few discoveries today. Good news first

upstream has a to_bitmask method.
While we cant do dynamic swizzle, since we know all the values a-priori, there's no reason why we cant statically construct a big jump table.

That means we dont need to have any fallbacks to intrinsics or scalar code!

Bad news: The generator I used for the proptests would generate some roaring bitmaps with disjoint container sets. I fixed that, but got some failures in sub and xor.

saik0 commented 2 years ago

@Kerollmops Could you run tests and bench on your hardware on 00275309692b37e19153735c1c2768deffa2ad0c

No more SIMD fallback! Super curious to see perf on your M1.
The x86 stuff should just no-op on other arch.
The x86-xor-simd bench is a lie. There is no impl for it, but the benchmark function wants one.

Kerollmops commented 2 years ago

Hey @saik0,

All the tests are passing on my M1 🎉 I didn't run protest though, but I can. I just don't remember how 😄

Here are the benchmarks on my MacBook Pro M1 that I carefully plugged in to make sure performances are good. criterion on M1.zip

No more SIMD fallback! Super curious to see perf on your M1.

Do you mean that no more functions are using scalar implementations?

The x86 stuff should just no-op on other arch.

What do you mean by that? Do you mean that we can simply compile it on ARM and it will just ignore that part of the code and only compile the aarch64 part?

saik0 commented 2 years ago

All the tests are passing on my M1 🎉 I didn't run protest though, but I can. I just don't remember how

They are part of the normal suite. 🎉 You may wish to set PROPTEST_CASES env var to some large num as a property based quasi-fuzz-test

Do you mean that no more functions are using scalar implementations?

I do! The jump table idea is a truly terrible one, but it does work. We will need to keep it until llvm exposes a dynamic swizzle and std simd updates to use it. Then I can atone for my sin 😅

Briefly, to call it a stack frame must be pushed and popped inside the hot vectorized loop. Inlining would be a huge instruction bloat. (on my HW it's about 5% faster if we force inline anyways)

What do you mean by that? Do you mean that we can simply compile it on ARM and it will just ignore that part of the code and only compile the aarch64 part

Yes. 🤞

The x86 stuff wont make it into a PR, but it is useful for comparing std simd directly to using x86 intrinsics. There are too many other variables when comparing it to C.

saik0 commented 2 years ago

Here are the benchmarks on my MacBook Pro M1 that I carefully plugged in to make sure performances are good. criterion on M1.zip

Wow and and sub 😲 😍

Not sure what's going on with or and xor that C's scalar code is so much faster. Interesting.

Kerollmops commented 2 years ago

Wow and and sub 😲 😍

I am also impressed, especially because the kind of flamegraph that we are trying to optimize relies a lot on the bit_assign on array containers (not sure if those ops are array-array or array-bitmap). But also on the bitor_assign that is already a lot faster than the previous scalar one 🎉

flamegraph after.svg.zip

Not sure what's going on with or and xor that C's scalar code is so much faster. Interesting.

Do you mean that the C's codebase hasn't been ported to ARM/aarch64? Do you think that different algorithms could be used to fit the ARM instructions extensions and that that's the reason they didn't port it to this arch, yet?

They are part of the normal suite. 🎉 You may wish to set PROPTEST_CASES env var to some large num as a property based quasi-fuzz-test

I just ran them with PROPTEST_CASES=1000 and everything works fine, thank you! 🎉 👌

workingjubilee commented 2 years ago

Do you mean that the C's codebase hasn't been ported to ARM/aarch64? Do you think that different algorithms could be used to fit the ARM instructions extensions and that that's the reason they didn't port it to this arch, yet?

Yes, Arm has a very different suite of LUT-style operations. It doesn't have quite the same specialized string instructions (which can wind up being inferior to more "straightforward" SIMD implementations if used incorrectly, anyways).

workingjubilee commented 2 years ago

As far as "use intrinsics or not" goes, though: There are conversions to and from the std::simd types to their std::arch equivalents, where applicable, e.g. i8x16 to __m128i. This is intentional in that it allows people to use std::simd as far as it can take them and then, for a given architecture, use that architecture's std::arch intrinsics for further optimizations. Not that I want to discourage you from relying on what we do have! I am quite impressed with what you have managed here. Just noting that it is an option.

saik0 commented 2 years ago

38 Comments later. @TheLudlows Yes. SIMD is introduced.

@Kerollmops This can be closed.

Kerollmops commented 2 years ago

The work you have done on that side of the library is awesome! It brings the library to another level 🚀 The most impressive part for me is the fact that you achieve to use the new std::simd module, it will be so cool when it will be stabilized, all of those supported platforms by doing nothing 🤞

Thanks to @workingjubilee too for your help on this subject, hope this project helps you prioritize features for the std::simd module too 🏆

RoaringBitmap / roaring-rs