Closed TheLudlows closed 2 years ago
I took at look at portable SIMD. While it does define the packed SIMD types, some of the instructions used by the algs presented in the paper are target feature specific (such as the sse4.2 cmpistrm)
Another consideration is that the vectorized versions (at least for the two I've looked at, OR and AND) are not safe to use in-place (|=
, &=
).
A nice feature we have that rust makes trivial is the ability to take ownership of another RoaringBitmap
's container when the right hand side of an operation is to an owned RoaringBitmap
. For the context of this discussion I will refer to this as "container recycling"
Vectorization and container recycling are seemingly at odds. When performing an AND, should we use the faster vectorized alg that must allocate, or recycle the allocation and go with the slower one.
Or is there another option? Thinking out loud: We could keep a thread local buffer around, always use the vectorized version (when it's possible, and given it's the best choice). Then swap the pointers. The output vector would then never alias the same memory as either of the two input vectors.
Ported the vectorized intersection alg from CRoaring, and tried using it for &=
using the thread local / swap pattern I described.
Fun. I haven't convinced myself it's a good idea, but it sure is fast
Given the SIMD alg requires specific x86 features, I'm not at all sure how to go about testing.
Curious what you think @Kerollmops, @Nemo157
SIMD stuff in my very messy very experimental branch
Wow, is it me or is it way faster than the C equivalent? ๐ค
Given the SIMD alg requires specific x86 features, I'm not at all sure how to go about testing.
This is one of the reasons why I am not sure either, do we really want to bring raw SIMD algorithms into this crate? Can't we rely on the soon-to-be-released stdSIMD module?
SIMD stuff in my very messy very experimental branch
That's impressive, the speed is quite good and if I'm not wrong, even faster than C but... have you seen the complexity of it? The amount of code we will need to maintain just for array-array operations and on the x86 only platform. Do we want to maintain these 1095 of raw SIMD functions? Maybe that's fair and we effectively want/need that ๐
Wow, is it me or is it way faster than the C equivalent? ๐ค
The C impl only uses the vectorized version for &
but not &=
because it's not safe to use in place.
So I got around that by keeping a thread local array to use as a buffer, then swapping them. Reminds me of a shell game
This is one of the reasons why I am not sure either, do we really want to bring raw SIMD algorithms into this crate? Can't we rely on the soon-to-be-released stdSIMD module?
AFACT the std simd crate will only provide data type definitions and ops to load, store, mask, and swizzle https://doc.rust-lang.org/nightly/std/simd/index.html
I called out the lack of some x86 specific instructions like cmpistrm
in one of my other comments.
Maybe it's possible to write a more less performant but more generic SIMD intersection alg that only uses mask / shuffle. I'd be lying if I said I really understood how it works.
The C impl only uses the vectorized version for
&
but not&=
because it's not safe to use in place.So I got around that by keeping a thread local array to use as a buffer, then swapping them. Reminds me of a shell game
Have you tried to just always allocate the buffer instead of using a thread-local variable? I would be glad if we can avoid using that, I am sure that malloc is fast enough to generate an uninitialized memory area for us.
Maybe it's possible to write a more less performant but more generic SIMD intersection alg that only uses mask / shuffle. I'd be lying if I said I really understood how it works.
Maybe we can keep the work you have done and make sure that it doesn't impact the compilation on other platforms e.g. ARM, I am sure that's what you thought of! The thing that makes MeiliSearch use RoaringBitmap and not the CRoaring crate is because it compiles on all the platforms I tried. I would at least like it to continue in this way.
A good way to be sure that the codebase is portable could be to compile it for wasm-unknown-unknown
. I am ok to accept more complexity in the crate if we make it way faster than it is RN and that's the case with your experiments ๐
Have you tried to just always allocate the buffer instead of using a thread-local variable? I would be glad if we can avoid using that, I am sure that malloc is fast enough to generate an uninitialized memory area for us.
I dont think this in in the "spirit" of &=
If were going to alloc/free a bunch of temporary vecs, what makes it different from
let a = RoaringBitmap::arbitrary();
let b = RoaringBitmap::arbitrary();
let a = a & &b;
Also: memory fragmentation.
I suppose in the case of &
and -
we could stack allocate 8k then memcpy into the vec, as the upper bound on the cardinality of the result is max(card_lhs, card_rhs)
. With |
and xor
its card_lhs + card_rhs
so we'd need 16k, or to be clever in ways that would slow down the hot code path. For example: abort the array-array when the buffer would overflow signaling to the caller to create a bitmap.
pub fn intersect_assign_vector_stack(lhs: &mut Vec<u16>, rhs: &[u16]) {
let mut buf: [u16; 4096] = [0; 4096];
let len = unsafe {
intersect_vector16(
lhs.as_ptr(),
lhs.len(),
rhs.as_ptr(),
rhs.len(),
buf.as_mut_ptr(),
)
};
lhs.as_mut_slice()[..len].copy_from_slice(&buf[..len]);
lhs.truncate(len);
}
Also: memory fragmentation.
No. It's dropped after use. It wont fragment memory. ๐
I am sure that malloc is fast enough to generate an uninitialized memory area for us.
In my informal testing, comparing runtime:
malloc < stack+memcpy < thread local
Thinking out loud some more:
We dont generally operate on containers in isolation, it could be a worthwhile future optimization to "weave" an array container though the op over many containers, swapping them as we go. :)
Also, I wonder why CRoaring isn't doing this.
Do we want to maintain these 1095 of raw SIMD functions?
To be fair, most of it is LUTs. Also, this is only for 2 of the 4 ops.
Maybe we can keep the work you have done and make sure that it doesn't impact the compilation on other platforms e.g. ARM, I am sure that's what you thought of! The thing that makes MeiliSearch use RoaringBitmap and not the CRoaring crate is because it compiles on all the platforms I tried. I would at least like it to continue in this way.
Yes, although this experimental code is wildly unsafe and assumes platform and required features, it's not too much effort to conditionally compile. With a medium level of effort, we could also add runtime detection. Otherwise, without specifying target-features
or target-cpu
in the build, none of this will be compiled
In my informal testing, comparing runtime: malloc < stack+memcpy < thread local
Do you have some benchmark graphs? Just to see the impact of using your stack+memcpy technique compared to the thread-local one?
About the thread-local technique: I am worried about platform requirements i.e. wasm and this is why I would prefer we use simple and common techniques to make sure the crate compiles everywhere. As you can see in the documentation of the thread-local LocalKey
type, there even is a section explaining that sometimes destructors are not run ๐จ Even if it is not quite an issue because even if the Vec
inside isn't run, the program stops and the OS frees all the memory anyway.
With a medium level of effort, we could also add runtime detection.
I like that idea, the only downside of it is code complexity.
To be fair, most of it is LUTs. Also, this is only for 2 of the 4 ops.
What do you mean by LUTs, here?
With | and xor its card_lhs + card_rhs so we'd need 16k, or to be clever in ways that would slow down the hot code path. For example: abort the array-array when the buffer would overflow signaling to the caller to create a bitmap.
Hum... you are right! If we can find a way to abort the operation when it detects that there are too many values we can keep a small stack size and continue with a bitmap, indeed. But that would be a brand new algorithm to create.
Maybe we can start by using this thread_local!
macro everywhere we need it and remove it when we are convenient enough with a new way to compute thing without this commodity.
In my informal testing, comparing runtime: malloc < stack+memcpy < thread local
Do you have some benchmark graphs? Just to see the impact of using your stack+memcpy technique compared to the thread-local one?
About the thread-local technique: I am worried about platform requirements i.e. wasm and this is why I would prefer we use simple and common techniques to make sure the crate compiles everywhere. As you can see in the documentation of the thread-local LocalKey type, there even is a section explaining that sometimes destructors are not run ๐จ Even if it is not quite an issue because even if the Vec inside isn't run, the program stops and the OS frees all the memory anyway.
< being better, meaning less runtime. In other words: you are correct, malloc/free is faster! I should have known better than to try to outsmart jemalloc.
Hum... you are right! If we can find a way to abort the operation when it detects that there are too many values we can keep a small stack size and continue with a bitmap, indeed. But that would be a brand new algorithm to create.
We would effectively be putting bounds checks back in, albeit once per loop iter, rather than on every memory access . I'm not convinced it would be worth the complexity for the small incremental gain.
What do you mean by LUTs, here?
Look Up Tables
It's about ~325 LoC for union and intersect algs, and ~700 lines of LUTs
I looked into what it would take to translate the vectorized ops to std::simd
There are some instructions that are not supported by std::simd
.
std::simd
by doing a.lanes_le(a).select(a, b)
. llvm will optimize this to min for the target instruction set when a min instruction is available. Amazing. [1]lanes_le
with lanes_gt
std::simd
only has swizzle by static vector. Both of these will likely be implemented by std::simd
eventually. For now we can write functions that utilize conditional compilation to call the appropriate arch intrinsic when the instruction is available, and fallback to a scalar implementation when none is available. [2]std::simd
adds the methods we can migrate to them. [3] [4][1] Here is an example of sse_merge
function from CRoaring transted to std::simd
. Compare the x86 asm to C source ๐
[2] bitmask and swizzle shims
[3] swizzle issue in std::simd
[4] bitmask issue in std::simd
Once we have these basic building blocks it would be relatively easy to translate CRoaring union_vector16
to std::simd
and use slices for inputs and output rather than passing raw pointers around.
cmpistrm
std::simd
if we provide a couple of shims for functions it does not provideWould we like to start a discussion re: whether or not we want to add SIMD to the project, plus which set of SIMD primitives we should be coding against, or should we wait until we have a better understanding of what's required for And, Sub.
Got it. Implemented how cmpistrm
was used in terms of or and shift. It compiles to about ~ 30 instrs but cmpistrm
is a low throughput instruction. On my hardware, it benchmarked the same when I replaced it with the or-shift.
It should be easy now to replace the rest of the alg with std simd. The rest is just the "machinery" that does the looping, loads, and stores.
Wow! You are such a beast! That's amazing ๐ ๐๏ธ
According to the benchmarks you sent we are even with the C and x86_simd
version with std::simd
, does it means that your algorithm is maybe easier to read but also compatible with more targets than just x86 i.e. ARM, wasm?
being better, meaning less runtime. In other words: you are correct, malloc/free is faster! I should have known better than to try to outsmart jemalloc.
Does the pairwise_and_assign
implementation in this last benchmark simply call jemalloc now? Or does it use thread_local!
?
[...] whether or not we want to add SIMD to the project, plus which set of SIMD primitives we should be coding against [...]
I think we, and more specifically you, found the best compromise: Using std::simd
as much as possible, this should bring more control to the codebase i.e. easier to maintain/understand and be compatible with more targets than raw x86 unsafe intrinsics. If you are able to find a way to use std::simd
for most parts of all the algorithms, we can concede to implement some functions by hand as you've done with cmpistrm
, it seems to work great and is maintainable enough.
According to the benchmarks you sent we are even with the C and x86_simd version with std::simd, does it means that your algorithm is maybe easier to read but also compatible with more targets than just x86 i.e. ARM, wasm?
It's actually slightly slower than the x86 intrinsics. I think it's worth it, given it runs on x86, ARM, and WASM.
Does the pairwise_and_assign implementation in this last benchmark simply call jemalloc now? Or does it use thread_local!?
Yes. No thread local in this bench.
we can concede to implement some functions by hand as you've done with cmpistrm, it seems to work great and is maintainable enough.
Rough overview:
mod util
contains some helpers that are implemented in terms of std::simd
mod shim
contains some compatibility shims that cant be implemented in terms of std::simd (or at least, not yet)
What's fun with computers is the different results you can get when benchmarking on different versions of a CPU, here is what I see when benchmarking on my iMac 2019 Intel Core i9 8ย cores, 3,6 GHz. I don't know why but the c version doesn't seem to find the desired extensions to compile properly. I set the RUSTFLAGS="-C target-cpu=native"
before benchmarking. Also, note that it is a home computer and that there is a lot of noise on the CPU.
Yes. No thread local in this bench.
This is great! This is so good!
Rough overview:
mod util
contains some helpers that are implemented in terms of std::simd
mod shim
contains some compatibility shims that cant be implemented in terms of std::simd (or at least, not yet)
I looked at your code, that's indeed very good that with std::simd
we will be able to use the same codebase for all of the targets. I tried out your branch on my MacBook Pro M1 (ARM) here are the results. I don't understand why the graphs are ugly though. Performances are not as good as x86 probably due to the places where NEON direct calls are missing, I guess.
@Kerollmops
My translation workflow goes something like this
TLDR: Be aware: in this branch for any given commit: "cur" might be anything, as it calls the default op, which I might have pointed to any given implementation for testing.
I've not gated x86 unsafe code with compile time checks. Your compiler might emit any garbage. I hope it does not eat your lunch. :)
Super interesting that the rust beats C for scalar ARM. "opt_unsafe" is a pretty straightforward port of CRoaring scalar code.
My hypothesis: We have those two different implementations of intersect_skewed. Rust can statically guarantee to LLVM that the mutable ref will not alias the two immutable refs, which enables LLVM to do optimizations that it cant with C.
Performances are not as good as x86 probably due to the places where NEON direct calls are missing, I guess.
That sounds about right. My high level plan
What I'm not at all sure of is how to integrate testing and CI. Qemu? ๐ค
Also undecided on bounds checks. The classic safety-performance-maintainability things in contention.
My intuition is to leave bounds checks in and later we can profile and look at the call stack, removing from only the hottest of hot code paths. Or perhaps we could come up with ways to express the algs in terms of iterators and bury the unsafe code in them.
I suspect there are still some lower-hanging fruit to optimize without sacrificing safety. I'll open some more issues. :)
Also undecided on bounds checks. The classic safety-performance-maintainability things in contention.
My intuition is to leave bounds checks in and later we can profile and look at the call stack, removing from only the hottest of hot code paths. Or perhaps we could come up with ways to express the algs in terms of iterators and bury the unsafe code in them.
I think that we should indeed use as much safe code as possible. I, myself fall into this issue of trying to use too much unsafe and lose performances due to this. You can read this small story of implementing the slice_group_by
methods with as much safe code as possible and gain performance compared to a fully unsafe implementation!
Sometimes, even putting assert!
s in the right places helps the compiler a lot!
What I'm not at all sure of is how to integrate testing and CI. Qemu? ๐ค
I looked at how the RoaringBitmap/CRoaring library was testing this in their CI and it seems like they test it on a Windows ARM machine. Is it a custom one? didn't check more. Maybe @lemire can help us on this subject?
OK. Done with the initial impl + added proptests for xor. Going to run proptests on a generator that only creates array stores for a few hours. ๐ค ๐ฌ ๐ค
I forgot to mention, I stripped out all the unsafe code around SIMD. (prior to test and bench)
It survived 4 hours of proptest without any failures or crashes.
Here is the full report of the op benchmarks. Note that unlike the Array-Array only benchmarks I shared earlier, this is for the full dataset and includes Array-Bitmap and Bitmap-Bitmap operations. It's surprisingly comparable to CRoaring in my environment. For reference I've got a Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz on a MacBook Pro. ROARING_ARCH=native RUSTFLAGS='-C target-cpu=native'
I'm going to clean it all up and open a PR in the next few days.
I just forget about that ๐คฆ but the current CI only checks that the program works on the stable channel of Rust, however, you introduce the use of std::simd
and therefore can only be used on the nightly channel with the portable_simd
feature. We must add a new job to the CI and bors to test on the nightly channel.
I would like to be able to use roaring-rs
on the stable channel like before, would it be possible to introduce a new nightly
, portable_simd
or simply, simd
feature to the Cargo.toml, that will enable your work. This feature would simply enable the portable_simd
std feature.
// At the top of the lib.rs
#![cfg_attr(feature = "simd", feature(portable_simd))]
It's surprisingly comparable to CRoaring in my environment.
Yup, that's impressive how hard it is now to see the performance difference between the C and Rust code! ๐๏ธ You've done such an impressive amount of work! Could you send me an email, please? I can't find a contact channel on your profile.
Yup, that's impressive how hard it is now to see the performance difference between the C and Rust code! ๐๏ธ You've done such an impressive amount of work! Could you send me an email, please? I can't find a contact channel on your profile.
Thanks. Sent you an email. =)
I made a few discoveries today. Good news first
to_bitmask
method.That means we dont need to have any fallbacks to intrinsics or scalar code!
Bad news: The generator I used for the proptests would generate some roaring bitmaps with disjoint container sets. I fixed that, but got some failures in sub and xor.
@Kerollmops Could you run tests and bench on your hardware on 00275309692b37e19153735c1c2768deffa2ad0c
Hey @saik0,
All the tests are passing on my M1 ๐ I didn't run protest though, but I can. I just don't remember how ๐
Here are the benchmarks on my MacBook Pro M1 that I carefully plugged in to make sure performances are good. criterion on M1.zip
- No more SIMD fallback! Super curious to see perf on your M1.
Do you mean that no more functions are using scalar implementations?
- The x86 stuff should just no-op on other arch.
What do you mean by that? Do you mean that we can simply compile it on ARM and it will just ignore that part of the code and only compile the aarch64 part?
All the tests are passing on my M1 ๐ I didn't run protest though, but I can. I just don't remember how
They are part of the normal suite. ๐ You may wish to set PROPTEST_CASES
env var to some large num as a property based quasi-fuzz-test
Do you mean that no more functions are using scalar implementations?
I do! The jump table idea is a truly terrible one, but it does work. We will need to keep it until llvm exposes a dynamic swizzle and std simd updates to use it. Then I can atone for my sin ๐
Briefly, to call it a stack frame must be pushed and popped inside the hot vectorized loop. Inlining would be a huge instruction bloat. (on my HW it's about 5% faster if we force inline anyways)
What do you mean by that? Do you mean that we can simply compile it on ARM and it will just ignore that part of the code and only compile the aarch64 part
Yes. ๐ค
The x86 stuff wont make it into a PR, but it is useful for comparing std simd directly to using x86 intrinsics. There are too many other variables when comparing it to C.
Here are the benchmarks on my MacBook Pro M1 that I carefully plugged in to make sure performances are good. criterion on M1.zip
Wow and
and sub
๐ฒ ๐
Not sure what's going on with or
and xor
that C's scalar code is so much faster. Interesting.
Wow
and
andsub
๐ฒ ๐
I am also impressed, especially because the kind of flamegraph that we are trying to optimize relies a lot on the bit_assign
on array containers (not sure if those ops are array-array or array-bitmap). But also on the bitor_assign
that is already a lot faster than the previous scalar one ๐
Not sure what's going on with
or
andxor
that C's scalar code is so much faster. Interesting.
Do you mean that the C's codebase hasn't been ported to ARM/aarch64? Do you think that different algorithms could be used to fit the ARM instructions extensions and that that's the reason they didn't port it to this arch, yet?
They are part of the normal suite. ๐ You may wish to set
PROPTEST_CASES
env var to some large num as a property based quasi-fuzz-test
I just ran them with PROPTEST_CASES=1000
and everything works fine, thank you! ๐ ๐
Do you mean that the C's codebase hasn't been ported to ARM/aarch64? Do you think that different algorithms could be used to fit the ARM instructions extensions and that that's the reason they didn't port it to this arch, yet?
Yes, Arm has a very different suite of LUT-style operations. It doesn't have quite the same specialized string instructions (which can wind up being inferior to more "straightforward" SIMD implementations if used incorrectly, anyways).
As far as "use intrinsics or not" goes, though: There are conversions to and from the std::simd
types to their std::arch
equivalents, where applicable, e.g. i8x16
to __m128i
. This is intentional in that it allows people to use std::simd
as far as it can take them and then, for a given architecture, use that architecture's std::arch
intrinsics for further optimizations. Not that I want to discourage you from relying on what we do have! I am quite impressed with what you have managed here. Just noting that it is an option.
38 Comments later. @TheLudlows Yes. SIMD is introduced.
@Kerollmops This can be closed.
The work you have done on that side of the library is awesome! It brings the library to another level ๐
The most impressive part for me is the fact that you achieve to use the new std::simd
module, it will be so cool when it will be stabilized, all of those supported platforms by doing nothing ๐ค
Thanks to @workingjubilee too for your help on this subject, hope this project helps you prioritize features for the std::simd
module too ๐
hello, Will simd be introduced?