md5: add inline assembly support for `x86_64` and `x86`

This PR implements md5 using inline assembly, which has the benefit of compiling and running on MSVC targets.

This serves as a demo to a path to address #315, https://github.com/RustCrypto/asm-hashes/issues/45, and https://github.com/RustCrypto/asm-hashes/issues/17

Performance Consideration

In the assembly implementation, a 3-way addition (2 registers + 1 immediate) is implemented as

lea <dst>, [<dst> + <index> + <displacement>]

https://github.com/johnmave126/hashes/blob/40a315dea9d5619e664c3b70749d0af692bca8b8/md5/src/asm/x86.rs#L81

However in Intel processors before Ice Lake, 3-way LEA has a latency of 3 cycles and uses only 1 port. (see here) In contrast, LEA has a latency of 1 and can be dispatched in parallel to 2 ports on newer processors.

In older processors, the following is better:

lea <dst>, [<dst> + <displacement>]
add <dst>, <index>

LLVM always emits something similar to the second flavor. Hence inline assembly will perform slower on processors before Ice Lake, but faster afterwards.

On Intel i7-6700K (Skylake), ubuntu-22.04.1:

> cargo +nightly bench
running 4 tests
test md5_10    ... bench:          15 ns/iter (+/- 2) = 666 MB/s
test md5_100   ... bench:         143 ns/iter (+/- 16) = 699 MB/s
test md5_1000  ... bench:       1,453 ns/iter (+/- 16) = 688 MB/s
test md5_10000 ... bench:      14,162 ns/iter (+/- 27) = 706 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 0.87s

> cargo +nightly bench --features inline-asm
running 4 tests
test md5_10    ... bench:          16 ns/iter (+/- 0) = 625 MB/s
test md5_100   ... bench:         162 ns/iter (+/- 0) = 617 MB/s
test md5_1000  ... bench:       1,661 ns/iter (+/- 2) = 602 MB/s
test md5_10000 ... bench:      16,307 ns/iter (+/- 14) = 613 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 1.16s

Using LEA+ADD in inline assembly still beats non-asm on older processors, but not by much:

> cargo +nightly bench --features inline-asm
running 4 tests
test md5_10    ... bench:          14 ns/iter (+/- 0) = 714 MB/s
test md5_100   ... bench:         145 ns/iter (+/- 0) = 689 MB/s
test md5_1000  ... bench:       1,412 ns/iter (+/- 6) = 708 MB/s
test md5_10000 ... bench:      13,787 ns/iter (+/- 16) = 725 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 0.98s

On AMD Ryzen 9 5900X (Zen 3), Windows 10 22H2:

> cargo +nightly bench
running 4 tests
test md5_10    ... bench:          13 ns/iter (+/- 0) = 769 MB/s
test md5_100   ... bench:         124 ns/iter (+/- 4) = 806 MB/s
test md5_1000  ... bench:       1,227 ns/iter (+/- 17) = 814 MB/s
test md5_10000 ... bench:      12,255 ns/iter (+/- 136) = 815 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 1.88s

> cargo +nightly bench --features inline-asm
running 4 tests
test md5_10    ... bench:          11 ns/iter (+/- 0) = 909 MB/s
test md5_100   ... bench:         102 ns/iter (+/- 1) = 980 MB/s
test md5_1000  ... bench:       1,023 ns/iter (+/- 10) = 977 MB/s
test md5_10000 ... bench:      10,434 ns/iter (+/- 366) = 958 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 6.79s

Dependency

The inline assembly code for md5 comes from a side project by me, which uses a declarative macro crate (asm_block also by me) to help defining macros emitting assembly. This introduces an additional dependency. If that's unwanted, we can strip down the asm_block macro to minimally viable and embed it directly.

MSRV

Enabling inline assembly will bump MSRV to 1.59.

CI

The PR also includes an additional CI job to test this feature.

Thank you!

After a cursory look, I have the following comments:

It could be worth to move the block iteration loop into the asm block. It could improve performance a bit since the hasher state will be kept in registers without spilling into memory. LLVM is not always able to move loads and stores outside of a loop, so it's better to be explicit.
I thought the RBX register is reserved by LLVM and should not be used by inline asm. Are you sure your code is correct in this regard?
What about binary code size? Could be the performance difference be explained by aggressive inlining?
How much worse the code would be without asm_block? We are somewhat conservative about third-party dependencies, but your macro looks quite useful for repetitive assembly and we may use it in other crates in future. Would you be willing to transfer it to the utils repo?

I will try to do a more thorough review somewhat later.

It could be worth to move the block iteration loop into the asm block. It could improve performance a bit since the hasher state will be kept in registers without spilling into memory. LLVM is not always able to move loads and stores outside of a loop, so it's better to be explicit.

Worth a try. Will try that during the weekend. The number of registers on x86_64 supports this, but I haven't given it a thorough thought on how to do this on x86 yet.

I thought the RBX register is reserved by LLVM and should not be used by inline asm. Are you sure your code is correct in this regard?

sp is always reserved. bp and si are only reserved on x86 and bx is only reserved on x86_64.

What about binary code size? Could be the performance difference be explained by aggressive inlining?

binary size for bench executable without feature

> du -s ../target/release/deps/mod-bbe1efd848abf6c7
6532    ../target/release/deps/mod-bbe1efd848abf6c7

binary size for bench executable with feature inline-asm

> du -s ../target/release/deps/mod-2aedef2c10a10821
6532    ../target/release/deps/mod-2aedef2c10a10821

Not sure why they are the same. Did I do something wrong?

I inspected the assembly using cargo asm, both versions are inlined very similarly. The main reason I think it was due to lea is that assuming add+add on x86, the approximate number of cycles for the md5 assembly is 592. Changing it to lea adds 1 cycle to each md5 operator, making it now 656 cycles. 592/656 is approximately 0.9, which is very close to the observed performance difference.

How much worse the code would be without asm_block? We are somewhat conservative about third-party dependencies, but your macro looks quite useful for repetitive assembly and we may use it in other crates in future. Would you be willing to transfer it to the utils repo?

If we don't introduce any utility, the best we can do would be something like this:

#[cfg_attr(rustfmt, rustfmt_skip)]
macro_rules! op_f {
    ($a: literal, $b: literal, $c: literal, $d: literal, $t: literal, $s: literal, $k: literal, $tmp: literal) => {
        concat!(
            "mov ", $tmp, ", ",  $c, "\n",
            "add ", $a, ", ", $t, "\n",
            "xor ", $tmp, ", ", $d, "\n",
            "and ", $tmp, ", ", $b, "\n",
            "xor ", $tmp, ", ", $d, "\n",
            "lea ", $a, ", [", $tmp, " + ", $a, " + ", $k, "]\n",
            "rol ", $a, ", ", $s, "\n",
            "add ", $a, ", ", $b, "\n",
        )
    };
}

And call site will be something like

op_f!("{a:e}", "{b:e}", "{c:e}", "{d:e}", "[{x} + 0]", 7, 0xd76aa478, "{t1:e}")

I personally find the concatenation is hard to read, especially when there are commas in the string. The main disadvantage is that it was even harder to write.

And I don't mind moving my crate elsewhere. Just let me know how it should be done.

Tested on AMD Ryzen 9 5900X (Zen 3), Windows 10 22H2:

Before moving block loop inside asm

x86_64:

running 4 tests
test md5_10    ... bench:          11 ns/iter (+/- 0) = 909 MB/s
test md5_100   ... bench:         102 ns/iter (+/- 5) = 980 MB/s
test md5_1000  ... bench:       1,016 ns/iter (+/- 31) = 984 MB/s
test md5_10000 ... bench:      10,190 ns/iter (+/- 302) = 981 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 2.99s

x86:

running 4 tests
test md5_10    ... bench:          11 ns/iter (+/- 0) = 909 MB/s
test md5_100   ... bench:         104 ns/iter (+/- 4) = 961 MB/s
test md5_1000  ... bench:       1,017 ns/iter (+/- 13) = 983 MB/s
test md5_10000 ... bench:      10,189 ns/iter (+/- 235) = 981 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 4.28s

After moving block loop inside asm

x86_64:

running 4 tests
test md5_10    ... bench:          11 ns/iter (+/- 0) = 909 MB/s
test md5_100   ... bench:         102 ns/iter (+/- 2) = 980 MB/s
test md5_1000  ... bench:       1,002 ns/iter (+/- 17) = 998 MB/s
test md5_10000 ... bench:       9,959 ns/iter (+/- 72) = 1004 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 2.22s

x86:

running 4 tests
test md5_10    ... bench:          11 ns/iter (+/- 0) = 909 MB/s
test md5_100   ... bench:         103 ns/iter (+/- 1) = 970 MB/s
test md5_1000  ... bench:       1,002 ns/iter (+/- 38) = 998 MB/s
test md5_10000 ... bench:       9,931 ns/iter (+/- 112) = 1006 MB/s

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out; finished in 1.94s

RustCrypto / hashes