BurntSushi / byteorder

Rust library for reading/writing numbers in big-endian and little-endian.
The Unlicense
981 stars 143 forks source link

Speed up slice writes #116

Closed AdamNiederer closed 3 years ago

AdamNiederer commented 6 years ago

Hi there,

I've been toying around with adding faster to a few encoding libraries, and I noticed that I could get up to a 6x speed boost by using it in write_u16_into, write_u32_into, and write_u64_into. The compiler does a pretty good job of vectorizing the read functions.

Would there be any interest in adding this behind a feature?

Benchmarks: (Ivy Bridge host; 128-bit integer vectors)

faster (No difference between target-cpu=native and target-cpu=x86-64)
test slice_u16::write_big_endian    ... bench:      23,344 ns/iter (+/- 122) = 8567 MB/s
test slice_u32::write_big_endian    ... bench:      46,681 ns/iter (+/- 160) = 8568 MB/s
test slice_u64::write_big_endian    ... bench:     105,206 ns/iter (+/- 369) = 7604 MB/s
master (-C target-cpu=native)
test slice_u16::write_big_endian    ... bench:     147,829 ns/iter (+/- 269) = 1352 MB/s
test slice_u32::write_big_endian    ... bench:     112,241 ns/iter (+/- 652) = 3563 MB/s
test slice_u64::write_big_endian    ... bench:     108,404 ns/iter (+/- 571) = 7379 MB/s
BurntSushi commented 6 years ago

In terms of code, I would like to understand why these routines aren't being auto vectorized. Is there a way to convince the compiler to vectorize on its own?

Procedurally, I have two things:

AdamNiederer commented 6 years ago

In terms of code, I would like to understand why these routines aren't being auto vectorized. Is there a way to convince the compiler to vectorize on its own?

I think the compiler is trying to use movbe in this situation above all other options. Even without copy_nonoverlapping, I can't get it to vectorize. https://godbolt.org/g/pBVryA

I'd like the dust to settle a little more on explicit SIMD before introducing it into byteorder.

Understandable; stdsimd just agreed to break its entire API a few days ago.

I will not add any dependencies that introduce copyleft as a matter of principle.

Unfortunately, I'm not willing to license any of my projects under permissive licenses. Perhaps we could look at this using "raw" explicit SIMD once the state of stdsimd improves (and if llvm still can't autovectorize it by then).

velvia commented 5 years ago

@BurntSushi I'm debating whether to open a new issue or just comment here as its kinda related. I benchmarked/profiled using byte order in my Rust encoding library and also looked at the source code and noticed that most writes (at least on x86 / my MacBook Pro) get translated into many calls to _platform_memmove$VARIANT$Haswell or similar. I could get a 3x speedup just by using ptr::unaligned_write. So I'm wondering, why not make that simple improvement available to users? I imagine reads using unaligned_read offer similar speedups.

I know unaligned reads/writes aren't safe on all platforms, but since x86 it is safe and really fast, why not offer that for users of that platform?

(NOTE: the 3x speedup is using these two functions instead of write_uint::<LittleEndian>):

#[inline]
fn direct_write_uint_le(out_buffer: &mut Vec<u8>, value: u64, numbytes: usize) {
    out_buffer.reserve(8);
    unsafe {
        // We have checked the capacity so this is OK
        unsafe_write_uint_le(out_buffer, value, numbytes);
    }
}

#[inline(always)]
unsafe fn unsafe_write_uint_le(out_buffer: &mut Vec<u8>, value: u64, numbytes: usize) {
    let cur_len = out_buffer.len();
    let ptr = out_buffer.as_mut_ptr().offset(cur_len as isize) as *mut u64;
    std::ptr::write_unaligned(ptr, value.to_le());
    out_buffer.set_len(cur_len + numbytes);
}
BurntSushi commented 5 years ago

Your framing is a bit weird. You ask "why not" as if I've already made a decision to the contrary. I would definitely like to understand whether this is possible to do in safe code first. But I'm not in principle against speeding up these routines by using unaligned reads (or whatever is necessary). I would, for example, like to know why memmove isn't doing that for us already.

velvia commented 5 years ago

@BurntSushi sorry, didn't mean to suggest a decision was made, other than I assumed someone must have looked into this already. I don't have enough experience with Rust to understand how to understand why memmove isn't doing that, other than intuitively memmove seems like a more expensive operation than a primitive write.

Perhaps this is an OSX issue, I'm not sure... where it doesn't get inlined on OSX but does in Linux or something.

Also, regarding use of unsafe. Doesn't the underlying memmove rely on some unsafe macros already?

BurntSushi commented 5 years ago

Also, regarding use of unsafe. Doesn't the underlying memmove rely on some unsafe macros already?

Everything boils to down to unsafe eventually. This does not mean we should add more use of unsafe.

I'm not drawing a line in the sand that says, "no, we must not use unsafe." I'm simply saying that we should understand what we're doing, why the current method is sub-optimal and how to improve it. If improvements require unsafe, then that's OK, so long as we can justify it.

BurntSushi commented 5 years ago

(I have not looked into this at all.)

velvia commented 5 years ago

I'm not drawing a line in the sand that says, "no, we must not use unsafe." I'm simply saying that we should understand what we're doing, why the current method is sub-optimal and how to improve it. If improvements require unsafe, then that's OK, so long as we can justify it.

👍 I'm happy to help investigate but not sure where to start. I'd also be interested in less use of unsafe in my own code. :)

BurntSushi commented 5 years ago

This is what I'd do:

  1. Set up a micro-benchmark that I think roughly gauges a real work-load.
  2. Use perf (Linux only, other OSes will need to use different profiling tools) to look at the generated assembly. These functions should be small enough that I'd actually try to work through it. If it's calling memmove, then I might even go look at my platform's implementation of that (or just read the Assembly, hopefully from perf). Currently, we call ptr::copy_nonoverlapping when the endianness matches, which is indeed supposed to just be a memmove (in the most general case). When the endianness doesn't match---which is what I think this issue was originally opened with---then we just use a straight-forward loop. So you'll need to pick which case you're actually try to optimize here.
  3. Fiddle with the existing implementation(s) to see how the generated Assembly changes, if at all. It is possible to make it go faster? Examples here might include explicit loop unrolling or playing with the asserts outside of the loop (for the case where endianness does not match). It's not clear how much fiddling can be done in the case where the endianness does match, but it might be useful to try a naive loop instead of using copy_overlapping to compare their performance characteristics.
  4. Form a hypothesis about what might improve things. Try it out. e.g., Try your unaligned read/write strategy. Compare its benchmark performance with the status quo, and also compare the generated Assembly. What changed? Do the changes to the code seem consistent with observed performance differences, if any?
  5. If the result is a suggested change, try to summarize the above process with data and submit a PR.

Hopefully that helps a bit. It's just meant to be a rough sketch of what I personally would do if I were to work on this. It leaves out a lot of details, but that takes a lot more work to write up. If you haven't profiled code before, there should be some Rust specific articles out there that will help.

BurntSushi commented 5 years ago

@velvia I'm now taking a second look at your initial comment here, and I'm now thinking that it's completely unrelated to this specific issue. This issue is about writing slices, e.g., write_u16_into, and not the normal single value methods like write_uint or write_u64.

The single valued write methods use copy_nonoverlapping, which should translate to efficient unaligned reads/writes because we're handing it fixed sizes that the compiler should recognize. Indeed, this is exactly how std::ptr::write_unaligned is itself implemented. The only reason byteorder doesn't use write_unaligned is because it requires Rust 1.17 or newer, and I'm very conservative about bumping the MSRV of a crate like byteorder (let's please not litigate that in this issue). So it seems the task now is to understand why byteorder isn't generating the same code that write_unaligned is. I'd follow roughly the same process here as I outlined in my previous comment.

velvia commented 5 years ago

@BurntSushi I'll open a separate issue on this topic for write_u64 itself, I now have code in various branches and a benchmark that can be played with. Unfortunately I can't run Linux easily, but it can be a base for investigation.

BurntSushi commented 3 years ago

Closing due to inactivity.

Performance improvements are in general welcome. However, if they require unsafe code, it might be a good idea to open a new issue and discuss the technique first if you're worried about it not being merged. I'm not against using any unsafe code at all, I just want to make sure it is well justified.