Speed up slice writes - Githubissues

AdamNiederer commented 6 years ago

Hi there,

I've been toying around with adding faster to a few encoding libraries, and I noticed that I could get up to a 6x speed boost by using it in write_u16_into, write_u32_into, and write_u64_into. The compiler does a pretty good job of vectorizing the read functions.

Would there be any interest in adding this behind a feature?

Benchmarks: (Ivy Bridge host; 128-bit integer vectors)

faster (No difference between target-cpu=native and target-cpu=x86-64)
test slice_u16::write_big_endian    ... bench:      23,344 ns/iter (+/- 122) = 8567 MB/s
test slice_u32::write_big_endian    ... bench:      46,681 ns/iter (+/- 160) = 8568 MB/s
test slice_u64::write_big_endian    ... bench:     105,206 ns/iter (+/- 369) = 7604 MB/s
master (-C target-cpu=native)
test slice_u16::write_big_endian    ... bench:     147,829 ns/iter (+/- 269) = 1352 MB/s
test slice_u32::write_big_endian    ... bench:     112,241 ns/iter (+/- 652) = 3563 MB/s
test slice_u64::write_big_endian    ... bench:     108,404 ns/iter (+/- 571) = 7379 MB/s

BurntSushi commented 6 years ago

In terms of code, I would like to understand why these routines aren't being auto vectorized. Is there a way to convince the compiler to vectorize on its own?

Procedurally, I have two things:

I'd like the dust to settle a little more on explicit SIMD before introducing it into byteorder.
I will not add any dependencies that introduce copyleft as a matter of principle.

AdamNiederer commented 6 years ago

In terms of code, I would like to understand why these routines aren't being auto vectorized. Is there a way to convince the compiler to vectorize on its own?

I think the compiler is trying to use movbe in this situation above all other options. Even without copy_nonoverlapping, I can't get it to vectorize. https://godbolt.org/g/pBVryA

I'd like the dust to settle a little more on explicit SIMD before introducing it into byteorder.

Understandable; stdsimd just agreed to break its entire API a few days ago.

I will not add any dependencies that introduce copyleft as a matter of principle.

Unfortunately, I'm not willing to license any of my projects under permissive licenses. Perhaps we could look at this using "raw" explicit SIMD once the state of stdsimd improves (and if llvm still can't autovectorize it by then).

velvia commented 5 years ago

@BurntSushi I'm debating whether to open a new issue or just comment here as its kinda related. I benchmarked/profiled using byte order in my Rust encoding library and also looked at the source code and noticed that most writes (at least on x86 / my MacBook Pro) get translated into many calls to _platform_memmove$VARIANT$Haswell or similar. I could get a 3x speedup just by using ptr::unaligned_write. So I'm wondering, why not make that simple improvement available to users? I imagine reads using unaligned_read offer similar speedups.

I know unaligned reads/writes aren't safe on all platforms, but since x86 it is safe and really fast, why not offer that for users of that platform?

(NOTE: the 3x speedup is using these two functions instead of write_uint::<LittleEndian>):

#[inline]
fn direct_write_uint_le(out_buffer: &mut Vec<u8>, value: u64, numbytes: usize) {
    out_buffer.reserve(8);
    unsafe {
        // We have checked the capacity so this is OK
        unsafe_write_uint_le(out_buffer, value, numbytes);
    }
}

#[inline(always)]
unsafe fn unsafe_write_uint_le(out_buffer: &mut Vec<u8>, value: u64, numbytes: usize) {
    let cur_len = out_buffer.len();
    let ptr = out_buffer.as_mut_ptr().offset(cur_len as isize) as *mut u64;
    std::ptr::write_unaligned(ptr, value.to_le());
    out_buffer.set_len(cur_len + numbytes);
}