Open johnmave126 opened 1 year ago
Thank you!
After a cursory look, I have the following comments:
asm_block
? We are somewhat conservative about third-party dependencies, but your macro looks quite useful for repetitive assembly and we may use it in other crates in future. Would you be willing to transfer it to the utils repo?I will try to do a more thorough review somewhat later.
It could be worth to move the block iteration loop into the asm block. It could improve performance a bit since the hasher state will be kept in registers without spilling into memory. LLVM is not always able to move loads and stores outside of a loop, so it's better to be explicit.
Worth a try. Will try that during the weekend. The number of registers on x86_64
supports this, but I haven't given it a thorough thought on how to do this on x86
yet.
I thought the RBX register is reserved by LLVM and should not be used by inline asm. Are you sure your code is correct in this regard?
sp
is always reserved. bp
and si
are only reserved on x86
and bx
is only reserved on x86_64
.
What about binary code size? Could be the performance difference be explained by aggressive inlining?
binary size for bench executable without feature
> du -s ../target/release/deps/mod-bbe1efd848abf6c7
6532 ../target/release/deps/mod-bbe1efd848abf6c7
binary size for bench executable with feature inline-asm
> du -s ../target/release/deps/mod-2aedef2c10a10821
6532 ../target/release/deps/mod-2aedef2c10a10821
Not sure why they are the same. Did I do something wrong?
I inspected the assembly using cargo asm
, both versions are inlined very similarly. The main reason I think it was due to lea
is that assuming add+add
on x86
, the approximate number of cycles for the md5 assembly is 592. Changing it to lea
adds 1 cycle to each md5 operator, making it now 656 cycles. 592/656
is approximately 0.9, which is very close to the observed performance difference.
How much worse the code would be without
asm_block
? We are somewhat conservative about third-party dependencies, but your macro looks quite useful for repetitive assembly and we may use it in other crates in future. Would you be willing to transfer it to the utils repo?
If we don't introduce any utility, the best we can do would be something like this:
#[cfg_attr(rustfmt, rustfmt_skip)]
macro_rules! op_f {
($a: literal, $b: literal, $c: literal, $d: literal, $t: literal, $s: literal, $k: literal, $tmp: literal) => {
concat!(
"mov ", $tmp, ", ", $c, "\n",
"add ", $a, ", ", $t, "\n",
"xor ", $tmp, ", ", $d, "\n",
"and ", $tmp, ", ", $b, "\n",
"xor ", $tmp, ", ", $d, "\n",
"lea ", $a, ", [", $tmp, " + ", $a, " + ", $k, "]\n",
"rol ", $a, ", ", $s, "\n",
"add ", $a, ", ", $b, "\n",
)
};
}
And call site will be something like
op_f!("{a:e}", "{b:e}", "{c:e}", "{d:e}", "[{x} + 0]", 7, 0xd76aa478, "{t1:e}")
I personally find the concatenation is hard to read, especially when there are commas in the string. The main disadvantage is that it was even harder to write.
And I don't mind moving my crate elsewhere. Just let me know how it should be done.
Tested on AMD Ryzen 9 5900X (Zen 3), Windows 10 22H2:
x86_64
:
running 4 tests
test md5_10 ... bench: 11 ns/iter (+/- 0) = 909 MB/s
test md5_100 ... bench: 102 ns/iter (+/- 5) = 980 MB/s
test md5_1000 ... bench: 1,016 ns/iter (+/- 31) = 984 MB/s
test md5_10000 ... bench: 10,190 ns/iter (+/- 302) = 981 MB/s
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 2.99s
x86
:
running 4 tests
test md5_10 ... bench: 11 ns/iter (+/- 0) = 909 MB/s
test md5_100 ... bench: 104 ns/iter (+/- 4) = 961 MB/s
test md5_1000 ... bench: 1,017 ns/iter (+/- 13) = 983 MB/s
test md5_10000 ... bench: 10,189 ns/iter (+/- 235) = 981 MB/s
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 4.28s
x86_64
:
running 4 tests
test md5_10 ... bench: 11 ns/iter (+/- 0) = 909 MB/s
test md5_100 ... bench: 102 ns/iter (+/- 2) = 980 MB/s
test md5_1000 ... bench: 1,002 ns/iter (+/- 17) = 998 MB/s
test md5_10000 ... bench: 9,959 ns/iter (+/- 72) = 1004 MB/s
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 2.22s
x86
:
running 4 tests
test md5_10 ... bench: 11 ns/iter (+/- 0) = 909 MB/s
test md5_100 ... bench: 103 ns/iter (+/- 1) = 970 MB/s
test md5_1000 ... bench: 1,002 ns/iter (+/- 38) = 998 MB/s
test md5_10000 ... bench: 9,931 ns/iter (+/- 112) = 1006 MB/s
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 1.94s
This PR implements md5 using inline assembly, which has the benefit of compiling and running on MSVC targets.
This serves as a demo to a path to address #315, https://github.com/RustCrypto/asm-hashes/issues/45, and https://github.com/RustCrypto/asm-hashes/issues/17
Performance Consideration
In the assembly implementation, a 3-way addition (2 registers + 1 immediate) is implemented as
https://github.com/johnmave126/hashes/blob/40a315dea9d5619e664c3b70749d0af692bca8b8/md5/src/asm/x86.rs#L81
However in Intel processors before Ice Lake, 3-way
LEA
has a latency of 3 cycles and uses only 1 port. (see here) In contrast,LEA
has a latency of 1 and can be dispatched in parallel to 2 ports on newer processors.In older processors, the following is better:
LLVM always emits something similar to the second flavor. Hence inline assembly will perform slower on processors before Ice Lake, but faster afterwards.
On Intel i7-6700K (Skylake), ubuntu-22.04.1:
Using
LEA+ADD
in inline assembly still beats non-asm on older processors, but not by much:On AMD Ryzen 9 5900X (Zen 3), Windows 10 22H2:
Dependency
The inline assembly code for md5 comes from a side project by me, which uses a declarative macro crate (asm_block also by me) to help defining macros emitting assembly. This introduces an additional dependency. If that's unwanted, we can strip down the
asm_block
macro to minimally viable and embed it directly.MSRV
Enabling inline assembly will bump MSRV to 1.59.
CI
The PR also includes an additional CI job to test this feature.