Closed AdamNiederer closed 3 years ago
In terms of code, I would like to understand why these routines aren't being auto vectorized. Is there a way to convince the compiler to vectorize on its own?
Procedurally, I have two things:
In terms of code, I would like to understand why these routines aren't being auto vectorized. Is there a way to convince the compiler to vectorize on its own?
I think the compiler is trying to use movbe
in this situation above all other options. Even without copy_nonoverlapping
, I can't get it to vectorize. https://godbolt.org/g/pBVryA
I'd like the dust to settle a little more on explicit SIMD before introducing it into byteorder.
Understandable; stdsimd just agreed to break its entire API a few days ago.
I will not add any dependencies that introduce copyleft as a matter of principle.
Unfortunately, I'm not willing to license any of my projects under permissive licenses. Perhaps we could look at this using "raw" explicit SIMD once the state of stdsimd improves (and if llvm still can't autovectorize it by then).
@BurntSushi I'm debating whether to open a new issue or just comment here as its kinda related. I benchmarked/profiled using byte order in my Rust encoding library and also looked at the source code and noticed that most writes (at least on x86 / my MacBook Pro) get translated into many calls to _platform_memmove$VARIANT$Haswell
or similar. I could get a 3x speedup just by using ptr::unaligned_write. So I'm wondering, why not make that simple improvement available to users? I imagine reads using unaligned_read offer similar speedups.
I know unaligned reads/writes aren't safe on all platforms, but since x86 it is safe and really fast, why not offer that for users of that platform?
(NOTE: the 3x speedup is using these two functions instead of write_uint::<LittleEndian>
):
#[inline]
fn direct_write_uint_le(out_buffer: &mut Vec<u8>, value: u64, numbytes: usize) {
out_buffer.reserve(8);
unsafe {
// We have checked the capacity so this is OK
unsafe_write_uint_le(out_buffer, value, numbytes);
}
}
#[inline(always)]
unsafe fn unsafe_write_uint_le(out_buffer: &mut Vec<u8>, value: u64, numbytes: usize) {
let cur_len = out_buffer.len();
let ptr = out_buffer.as_mut_ptr().offset(cur_len as isize) as *mut u64;
std::ptr::write_unaligned(ptr, value.to_le());
out_buffer.set_len(cur_len + numbytes);
}
Your framing is a bit weird. You ask "why not" as if I've already made a decision to the contrary. I would definitely like to understand whether this is possible to do in safe code first. But I'm not in principle against speeding up these routines by using unaligned reads (or whatever is necessary). I would, for example, like to know why memmove isn't doing that for us already.
@BurntSushi sorry, didn't mean to suggest a decision was made, other than I assumed someone must have looked into this already. I don't have enough experience with Rust to understand how to understand why memmove isn't doing that, other than intuitively memmove seems like a more expensive operation than a primitive write.
Perhaps this is an OSX issue, I'm not sure... where it doesn't get inlined on OSX but does in Linux or something.
Also, regarding use of unsafe. Doesn't the underlying memmove rely on some unsafe macros already?
Also, regarding use of unsafe. Doesn't the underlying memmove rely on some unsafe macros already?
Everything boils to down to unsafe
eventually. This does not mean we should add more use of unsafe
.
I'm not drawing a line in the sand that says, "no, we must not use unsafe." I'm simply saying that we should understand what we're doing, why the current method is sub-optimal and how to improve it. If improvements require unsafe, then that's OK, so long as we can justify it.
(I have not looked into this at all.)
I'm not drawing a line in the sand that says, "no, we must not use unsafe." I'm simply saying that we should understand what we're doing, why the current method is sub-optimal and how to improve it. If improvements require unsafe, then that's OK, so long as we can justify it.
👍 I'm happy to help investigate but not sure where to start. I'd also be interested in less use of unsafe in my own code. :)
This is what I'd do:
perf
(Linux only, other OSes will need to use different profiling tools) to look at the generated assembly. These functions should be small enough that I'd actually try to work through it. If it's calling memmove
, then I might even go look at my platform's implementation of that (or just read the Assembly, hopefully from perf
). Currently, we call ptr::copy_nonoverlapping
when the endianness matches, which is indeed supposed to just be a memmove
(in the most general case). When the endianness doesn't match---which is what I think this issue was originally opened with---then we just use a straight-forward loop. So you'll need to pick which case you're actually try to optimize here.copy_overlapping
to compare their performance characteristics.Hopefully that helps a bit. It's just meant to be a rough sketch of what I personally would do if I were to work on this. It leaves out a lot of details, but that takes a lot more work to write up. If you haven't profiled code before, there should be some Rust specific articles out there that will help.
@velvia I'm now taking a second look at your initial comment here, and I'm now thinking that it's completely unrelated to this specific issue. This issue is about writing slices, e.g., write_u16_into
, and not the normal single value methods like write_uint
or write_u64
.
The single valued write methods use copy_nonoverlapping
, which should translate to efficient unaligned reads/writes because we're handing it fixed sizes that the compiler should recognize. Indeed, this is exactly how std::ptr::write_unaligned
is itself implemented. The only reason byteorder doesn't use write_unaligned
is because it requires Rust 1.17 or newer, and I'm very conservative about bumping the MSRV of a crate like byteorder (let's please not litigate that in this issue). So it seems the task now is to understand why byteorder isn't generating the same code that write_unaligned
is. I'd follow roughly the same process here as I outlined in my previous comment.
@BurntSushi I'll open a separate issue on this topic for write_u64 itself, I now have code in various branches and a benchmark that can be played with. Unfortunately I can't run Linux easily, but it can be a base for investigation.
Closing due to inactivity.
Performance improvements are in general welcome. However, if they require unsafe
code, it might be a good idea to open a new issue and discuss the technique first if you're worried about it not being merged. I'm not against using any unsafe code at all, I just want to make sure it is well justified.
Hi there,
I've been toying around with adding faster to a few encoding libraries, and I noticed that I could get up to a 6x speed boost by using it in
write_u16_into
,write_u32_into
, andwrite_u64_into
. The compiler does a pretty good job of vectorizing the read functions.Would there be any interest in adding this behind a feature?
Benchmarks: (Ivy Bridge host; 128-bit integer vectors)