Add Arm's NEON vectorization

eirnym commented 7 months ago

Could you please enable optimizations for macbooks by default as you've did for x86_64 CPUs

DoumanAsh commented 7 months ago

Please understand that current implementation only supports AVX2 and SSE2, therefore it is impossible to enable by default, as there is no NEON implementation

Now for matter of default in general NEON cannot be assumed to be default in general, but I believe all mac OS chips do so, so in theory I could assume that, but only for Mac OS.

Problem is that when I started this library NEON support in Rust's std was lacking and I'm not sure if they filled gaps yet to implement it I will try to take a look again later

eirnym commented 7 months ago

Most of features supported by LLVM has been implemented. Remaining unsupported features has not been implemented in LLVM as far as I understood the thread.

Documentation also describes many neon instructions, some of them available since Rust 1.59.0

https://doc.rust-lang.org/core/arch/arm/index.html https://doc.rust-lang.org/core/arch/aarch64/index.html

DoumanAsh commented 7 months ago

@eirnym Can you please give me output of rustc --print cfg on your M1 laptop? I'm curious if Neon is enabled by default on Mac

If so you can try to test my branch https://github.com/DoumanAsh/xxhash-rust/pull/35

eirnym commented 7 months ago

I have macOS M2 laptop:

$ rustc --print cfg
debug_assertions
panic="unwind"
target_arch="aarch64"
target_endian="little"
target_env=""
target_family="unix"
target_feature="aes"
target_feature="crc"
target_feature="dit"
target_feature="dotprod"
target_feature="dpb"
target_feature="dpb2"
target_feature="fcma"
target_feature="fhm"
target_feature="flagm"
target_feature="fp16"
target_feature="frintts"
target_feature="jsconv"
target_feature="lor"
target_feature="lse"
target_feature="neon"
target_feature="paca"
target_feature="pacg"
target_feature="pan"
target_feature="pmuv3"
target_feature="ras"
target_feature="rcpc"
target_feature="rcpc2"
target_feature="rdm"
target_feature="sb"
target_feature="sha2"
target_feature="sha3"
target_feature="ssbs"
target_feature="vh"
target_has_atomic="128"
target_has_atomic="16"
target_has_atomic="32"
target_has_atomic="64"
target_has_atomic="8"
target_has_atomic="ptr"
target_os="macos"
target_pointer_width="64"
target_vendor="apple"
unix

eirnym commented 7 months ago

my test:

Cargo.toml:

[package]
name = "public-id"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
base64 = "0.21.7"
uuid = { version = "1.7.0", features = ["v4", "v7", "v8"] }
#xxhash-rust = { version = "0.8.8", features = ["xxh3"] }
xxhash-rust = { git="https://github.com/DoumanAsh/xxhash-rust.git", branch="neon", features = ["xxh3"] }

src/main.rs:

use base64::{engine::general_purpose::URL_SAFE, Engine as _};

fn main() {
    let v: u64 = xxhash_rust::xxh3::xxh3_64(uuid::Uuid::new_v4().as_bytes());
    let b64 = URL_SAFE.encode(v.to_le_bytes());
    println!("Hello, world! {}", b64);
}

both apps (with and without neon optimizations) are compiled with --release, Cargo.lock is removed and fd xxhash-rust . -x rm -rf is run in ~/.cargo

hyperfine output:

$ hyperfine --warmup 1000 -N -u microsecond './public-id-neon-optimizations' ./public-id-no-optimizations

Benchmark 1: ./public-id-neon-optimizations
  Time (mean ± σ):     728.7 µs ±  16.9 µs    [User: 356.3 µs, System: 186.6 µs]
  Range (min … max):   697.4 µs … 1069.2 µs    4060 runs

Benchmark 2: ./public-id-no-optimizations
  Time (mean ± σ):     724.8 µs ±  15.2 µs    [User: 355.4 µs, System: 184.2 µs]
  Range (min … max):   692.9 µs … 920.6 µs    4129 runs

Summary
  ./public-id-no-optimizations ran
    1.01 ± 0.03 times faster than ./public-id-neon-optimizations

DoumanAsh commented 7 months ago

Well it is good that Mac has Neon enabled by default I will merge and release new version later

eirnym commented 7 months ago

stats for 256Mb of random data:

hyperfine --warmup 1000 -N -u microsecond './public-id-neon-optimizations' ./public-id-no-optimizations          
Benchmark 1: ./public-id-neon-optimizations
  Time (mean ± σ):     66061.1 µs ± 1809.4 µs    [User: 14959.7 µs, System: 50626.2 µs]
  Range (min … max):   63642.4 µs … 73034.2 µs    44 runs

Benchmark 2: ./public-id-no-optimizations
  Time (mean ± σ):     75613.7 µs ± 7321.5 µs    [User: 22832.6 µs, System: 51530.6 µs]
  Range (min … max):   70870.0 µs … 115103.5 µs    41 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  ./public-id-neon-optimizations ran
    1.14 ± 0.12 times faster than ./public-id-no-optimizations

DoumanAsh commented 7 months ago

Release 0.8.9 with Neon

DoumanAsh / xxhash-rust

Add Arm's NEON vectorization #34