Full Compile-Time Processor Feature Detection

Sewer56 commented 2 months ago

Currently memchr does run-time feature detection for x86, storing 'preferred' functions inside a field that is later referenced in subsequent calls

// Inside `unsafe_ifunc!`

// store
FN.store(fun as Fn, Ordering::Relaxed);

// use
let fun = FN.load(Ordering::Relaxed);

This manifests itself as a mov and call in x86; and one or two other misc instructions along the way.

In order to avoid unnecessary code bloat and improve performance, it should be possible to bypass this mechanism if we know what is available at build time. This would allow for inlining, and slightly lesser call overhead.

That is of course, when the Rust compiler is invoked with the correct flags to ship a binary targeting a specific processor.

Note

This isn't a formal feature request, it's more so a reminder for myself. I want to reference this issue in my own codebase, and possibly PR this sometime in the future.

BurntSushi commented 2 months ago

I think if would be better to try and do this without feature flags. We can detect at compile time whether, for example, the avx2 target feature is enabled.

Sewer56 commented 2 months ago

I think if would be better to try and do this without feature flags.

I wrote this when I was tired (still am). Seems you're right, even easier to adjust, in this case.

Ambyjkl commented 2 months ago

I haven't examined the code here yet, but it seems just setting rustflags to -Ctarget-feature=+avx2 might be enough to skip the check, not sure https://github.com/BurntSushi/memchr/pull/127/commits/b4a4c6aabd796f768b68c89a5f5eb6f18f44d4a2

Sewer56 commented 2 months ago

I haven't examined the code here yet, but it seems just setting rustflags to -Ctarget-feature=+aes2 might be enough to skip the check, not sure b4a4c6a

https://godbolt.org/z/h7av5rx3n

extern crate memchr;
use memchr::memchr;

#[no_mangle]
fn memchr_1(data: &[u8]) -> usize {
    memchr(0, &data).unwrap()
}

I believe these mov operation correspond to

let fun = FN.load(Ordering::Relaxed);

The commit you linked does add compile-time checks, but it still stores a function pointer. The goal here is to call the target directly, since it can be determined at compile time in this case.

Ambyjkl commented 2 months ago

I see, seems the compiler doesn't automatically inline the call, sad

Ambyjkl commented 1 month ago

so I have an issue that ties into this one, and solving this issue can help solve that. The SearcherKind enum is a bit too fat, especially on an avx2 supported cpu, it has two copies of the sse2 Finder along with an avx2 Finder, making the whole thing 288 bytes on x86_64, and I would benefit from having that go down to reduce memory usage. The fix would be to expose specialized sse2/avx2/neon Finders that statically only have the fields they actually need, and that would also do away with the function pointer. @BurntSushi @Sewer56 lmk if this solution would work for you guys too

Sewer56 commented 1 month ago

Would need to dig a bit more into it, but it sounds interesting at the very least. I'm always down to reduce stack usage.

Ambyjkl commented 1 month ago

Also @BurntSushi I love your work and just started sponsoring you through my company so I hope you read this :)

BurntSushi commented 1 month ago

so I have an issue that ties into this one, and solving this issue can help solve that. The SearcherKind enum is a bit too fat, especially on an avx2 supported cpu, it has two copies of the sse2 Finder along with an avx2 Finder, making the whole thing 288 bytes on x86_64, and I would benefit from having that go down to reduce memory usage.

I really do think this is an entirely separate issue. I addressed it as a Discussion question here: https://github.com/BurntSushi/memchr/discussions/164

I love your work and just started sponsoring you through my company

Thank you! :-)

BurntSushi commented 1 month ago

@Sewer56 Since it seems like you are trying to micro-optimize here, I think there are two important questions to answer:

Have you come up with a benchmark where this matters?
Have you tried any of the lower level APIs in crate::arch that give you more direct access to the operations defined in memchr? For example, memchr::arch::x86_64::avx2::memchr::One. There's no function pointers there.

Ambyjkl commented 1 month ago

I really do think this is an entirely separate issue. I addressed it as a Discussion question here: #164

Ok that's just what i was looking for, thank you, and OP can also just do this directly as well

Sewer56 commented 1 month ago

Have you come up with a benchmark where this matters?

I'm porting an open source archive format I wrote during work hours to Rust in my own time, while giving it a small upgrade:

https://sewer56.dev/sewer56-archives-nx/Specification/Table-Of-Contents/#string-pool

Reading names from the pool usually involves reading strings, usually average 40-ish characters, before hitting null terminator. If you have 40 strings, repeat that NumFiles times.

I did recently write a benchmark around this; and while it is true that in my use case the actual memchr operation takes very little time compared to decompression of the pool itself; I nonetheless strive to make the code as good as possible. In my specific case, I have inlining to gain from this; as it would be the only use case of memchr when compiled into a standalone dynamic library.

How much the improvement actually amounts to, I'm not concerned. I'd guess something in the range of 0.1-0.5% and a hundred or two bytes of assembly code less. To me any improvement is an improvement.

Have you tried any of the lower level APIs in crate::arch that give you more direct access to the operations defined in memchr? For example, memchr::arch::x86_64::avx2::memchr::One. There's no function pointers there.

I could use those directly, but it's more about improving the whole ecosystem.

Why just use a workaround for myself when I could instead eventually contribute and make an improvement for everyone?

It'd also avoid a death by 1000 cuts situations. Code depends on other code; if you make the lowest level code more efficient, you make the system as a whole more efficient. That means there's a chance for others' stuff to run faster too, and everyone, myself included gets to benefit in tandem.

Ambyjkl commented 1 month ago

So, I can see two improvements here:

memchr exports stuff like NativeFinder and similar which at compile time branches on the cpu feature flags to reexport memchr::arch::x86_64::(avx2|sse2)::*, etc, that would give you theoretically max performance at the cost of portability: basically something like -march=native
we replace the function pointer with an enum, and we branch on the enum at the call site to call the correct version. I believe switching on enum is generally faster than dereferencing function pointers. This solution would still be dynamic dispatch, but just better

Ambyjkl commented 1 month ago

it looks like PrefilterKind is a union, and it could just be changed to be an enum to make it a tagged union, and that might be enough to pull this off, unless there is something i'm missing

Ambyjkl commented 1 month ago

ok just did it in the code, all tests are passing, now time to figure out how to run the benchmarks

Sewer56 commented 1 month ago

(Bear in mind we're still talking about a separate issue here)

My idea when I originally opened the issue was making unsafe_ifunc! just call find_avx2/find_sse2/find_fallback directly if the method is resolveable at compile time. So it would just translate into a direct call, no branches, not anything else.

I may be able to pick this up not long from now, but the main question really is how to do it in a way that doesn't make it harder to read.

Essentially you want a block like this: https://github.com/BurntSushi/memchr/blob/a26dd5b590e0a0ddcce454fb1ac02f5586e50952/src/arch/x86_64/memchr.rs#L114-L137

At the beginning of unsafe_ifunc (or in a similar place); that would then just call the implementation (e.g. find_avx2/find_sse2/find_fallback) directly. The issue is we don't know (yet) if the user is specifically building for a specific arch or not.

For example, if someone is building for a non-AVX2 processor for example, and they do have std enabled (default), the current code would call std::is_x86_feature_detected!("avx2") which is not compile time [line 101],

https://github.com/BurntSushi/memchr/blob/a26dd5b590e0a0ddcce454fb1ac02f5586e50952/src/arch/x86_64/avx2/memchr.rs#L86-L109

even if someone is explicitly building for a target without avx2.

As an extra note. I'm also wondering if there's a way to just make it better out of the box if the user sets a specific target while running cargo/rustc. Because if you can determine if the user is building for a specific target-cpu (i.e. they set it via CLI), then that may be a possible hint to skip dynamic feature detection (I suggested a feature for this in OP originally in case a user would want to override and opt-into/out-out of feature detection in a scenario like this).

And if you can do that, you can then call whichever implementation you need directly; skipping the whole load/store mechanism.

Sewer56 commented 1 month ago

Bear in mind the last paragraph in previous post is an extra question/discussion point of sorts that just came to mind, as it's semi-relevant.

The actual issue I opened is that there isn't an automatic way to call the required implementation directly; even if you build for a specific arch by using -C target-cpu AND no_std.

This matters if you, for example want to use an iterator with Memchr::new(0, &data). In this case you don't manually call the One functions yourself. The library calls the functions for you, passing via unsafe_ifunc in the process.

I think an extra block of code for no_std at the start of unsafe_ifunc! might do the job for now. Since we're not doing feature detection on no_std, we can just choose there and then which to call, at compile time.

BurntSushi commented 1 month ago

@Sewer56

I did recently write a benchmark around this; and while it is true that in my use case the actual memchr operation takes very little time compared to decompression of the pool itself; I nonetheless strive to make the code as good as possible. In my specific case, I have inlining to gain from this; as it would be the only use case of memchr when compiled into a standalone dynamic library.

How much the improvement actually amounts to, I'm not concerned. I'd guess something in the range of 0.1-0.5% and a hundred or two bytes of assembly code less. To me any improvement is an improvement.

I am overall very skeptical of doing improvements just for improvement sake, especially if they cost something.

For this in particular, I'm willing to suffer a little implementation complexity internally to make this work. So if y'all can come up with something where if the relevant target features are enabled at compile time and inlining is possible, and it isn't too crazy, I might be open to that.

But based on what you've said, I don't see a good justification for any new APIs. Otherwise, I think you're on the right track in terms of how to do this internally and automatically.

I could use those directly, but it's more about improving the whole ecosystem.

Why just use a workaround for myself when I could instead eventually contribute and make an improvement for everyone?

It'd also avoid a death by 1000 cuts situations. Code depends on other code; if you make the lowest level code more efficient, you make the system as a whole more efficient. That means there's a chance for others' stuff to run faster too, and everyone, myself included gets to benefit in tandem.

But that's why those lower level APIs are there! They are there precisely for cases like this where the higher level implementation makes choices that are undesirable for more specific or niche use cases. The bottom line is that inlining a huge AVX2 substring search routine (it's a lot of code) is usually not the right thing to do and usually not going to improve overall performance. Even you admit that, for your use case, it won't really help move the needle on overall perf in any meaningful way. That's what I care about.

@Ambyjkl

we replace the function pointer with an enum, and we branch on the enum at the call site to call the correct version. I believe switching on enum is generally faster than dereferencing function pointers. This solution would still be dynamic dispatch, but just better

It's not. Or at least, I've seen it go either way enough times that I don't think you can really make the claim that dynamic dispatch is or isn't generally slower than using an enum.

In this specific case, I did absolutely use an enum at first. That's the obvious choice. But I found that using dynamic dispatch lead to tighter codegen and overall improved performance. Since memchr supports core-only, the dynamic dispatch can't be implemented with trait objects.

I've re-litigated this experiment in regex and aho-corasick too. Both of them use dynamic dispatch internally instead of enums. That is a result of having used enums before. However, they get to use trait objects instead of a union since they both require alloc.

I'm always happy to have the choice re-litigated and would switch given evidence that a simple enum is better. But it's not like I just didn't try it. It's the obvious choice here because it doesn't require any unsafe.

Sewer56 commented 1 month ago

I am irrelevant. I don't matter. If this was only about myself, I'd use the low level API and go with that. But I don't want to make things better for just myself, I would rather also make things better for other people if I can.

I don't see a good justification for any new APIs

No API requests here were made. I'm pretty sure I can make this work with a small change to unsafe_ifunc; just a matter of seeing how clean I could make this.

The bottom line is that inlining a huge AVX2 substring search routine (it's a lot of code) is usually not the right thing to do and usually not going to improve overall performance.

You are correct, however even if the code is not inlined, it would still be an improvement. The improvement (based on screenshot] would be saving 2 instructions:

mov rax, qword ptr [address]
mov rdx, rax.

Because no load is required from a static. So we can use call with an address directly. Currently, having to load a pointer, causes a CPU pipeline stall that could otherwise be avoided.

Whether the code will actually be inlined can be left to the compiler. Compiler will decide based on metrics such as amount of calls to the final target. Most big programs, it may not be if it's used in a lot of places, but if an application or library only uses memchr in one place, it will be profitable to inline.

BurntSushi commented 1 month ago

No API requests here were made.

Yeah sorry, I guess I was responding to this comment. I know I directed it at you, but that comment wasn't by you, so apologies.

I am irrelevant. I don't matter. If this was only about myself, I'd use the low level API and go with that. But I don't want to make things better for just myself, I would rather also make things better for other people if I can.

Again, this isn't really how I operate. I try hard not to do things for the sake of doing them. I would much rather concrete, real world and specific motivations for improvements or changes.

You are correct, however even if the code is not inlined, it would still be an improvement. The improvement (based on screenshot] would be saving 2 instructions:
* `mov rax, qword ptr [address]`

* `mov rdx, rax`.
Because no load is required from a static. So we can use call with an address directly. Currently, having to load a pointer, causes a CPU pipeline stall that could otherwise be avoided.

I generally don't base improvements on instruction count reductions.

I think we have different philosophies here and I think it's probably not productive to keep going back-and-forth on that. It seems like we are both aligned on being open to a small internal change that automatically avoids the indirect function call when it can.

Sewer56 commented 1 month ago

cargo.toml

memchr = { path = "memchr", features = ["std"] }

cargo bench

unpack_string_pool_4000_V0
                        time:   [36.165 µs 36.499 µs 36.908 µs]
                        change: [+32.095% +33.633% +35.025%] (p = 0.00 < 0.05)
                        Performance has regressed.

[unpack_string_pool_4000_V0] Unpacked size: 200343 bytes

RUSTFLAGS="-C target-cpu=x86-64-v3" cargo bench:

[create_string_pool_4000_V0] Packed size: 200343 bytes
unpack_string_pool_4000_V0
                        time:   [27.400 µs 27.523 µs 27.663 µs]
                        change: [-2.8856% -2.2842% -1.6698%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

Above command, with patch and no_std:

[create_string_pool_4000_V0] Packed size: 200343 bytes
unpack_string_pool_4000_V0
                        time:   [26.712 µs 26.769 µs 26.847 µs]
                        change: [-2.3623% -2.1856% -1.9750%] (p = 0.00 < 0.05)
                        Performance has improved.

Massive, microsecond improvement; especially since memchr here is around half the load. The raw improvement is around 4%, if we exclude other code; since this is a bench of a real workload.

Example change:

https://github.com/BurntSushi/memchr/pull/165

I put it as a PR for easy reading. I want to show in that PR what I've had in mind.

For the bench, I took the first 4000 file paths from Yakuza Kiwami (starting at game root), put null terminators on them and used that as the test data.

I then used the memchr iterator to search for the terminators, and copied the results into a separate buffer without the terminators; which is a real use case in my library. More specifically, the tested code is this method (https://github.com/Sewer56/sewer56-archives-nx/blob/62f061bcc3bc4a4e31a87fada9294a9df29ef0f3/src/headers/parser/string_pool.rs#L399), with no compression.

Ambyjkl commented 1 month ago

It's not. Or at least, I've seen it go either way enough times that I don't think you can really make the claim that dynamic dispatch is or isn't generally slower than using an enum.

That's interesting. Like @Sewer56 mentioned, function pointer derefs cause pipeline stalls, and well so does branching with a match, but there is speculative execution, and if the compiler can statically prove that a function pointer almost always points to the same function, or the same with one of the branches (which is what I originally thought might be happening in this comment https://github.com/BurntSushi/memchr/issues/160#issuecomment-2342408040), then performance could be faster. This kind of optimization is often easier for the compiler on switch/match than with function pointers since a function pointer is allowed to point to anything, while with switch/match the options are bounded won't need as much control-flow analysis. Here is some literature about this https://www.msreverseengineering.com/blog/2020/5/7/a-compiler-optimization-involving-speculative-execution-of-function-pointers

It's possible neither the function pointer nor the enum + match got optimized back when you tested then, and maybe things are different now, would need to benchmark (and btw, cargo bench doesn't do anything for me, it quits immediately, tips on that would be appreciated). I also found this Rust RFC https://github.com/rust-lang/rust/issues/26179 which unfortunately seems to be stuck in perma-limbo, it would've allowed telling the compiler which branch is the most likely one on a particular architecture.

Let me know about how to run cargo bench properly, also I might compare the generated code to see what is actually produced with either case to know what's going actually on, and if I do I'll share that.

BurntSushi commented 1 month ago

Trust me. I'm aware of all of that. :-) And yes, I even tried using likely intrInsics too.

I am always happy to have folks relitigate perf though.

Benchmarks are done with rebar. Not cargo bench. I'm on mobile so I can't give any more instruction at the moment.

Ambyjkl commented 1 month ago

Trust me. I'm aware of all of that. :-) And yes, I even tried using likely intrInsics too.

Well you have clearly done this a lot longer than I, haha. And thanks now I can actually run benchmarks

BurntSushi / memchr

Full Compile-Time Processor Feature Detection #160

Note