Add support for using VAES instructions for NI parallel operations.

silvanshade commented 10 months ago

This PR adds support for using VAES intrinsics for the ni backend for the aes 8-fold operations.

The change shows a nice speed up on Zen4 CPUs at least.

Benchmarks (Ryzen 7950x):

RUSTFLAGS="-C target-cpu=native" cargo bench:

running 15 tests
test aes128_decrypt_block  ... bench:       1,043 ns/iter (+/- 83) = 15708 MB/s
test aes128_decrypt_blocks ... bench:         944 ns/iter (+/- 4) = 17355 MB/s
test aes128_encrypt_block  ... bench:       1,042 ns/iter (+/- 2) = 15723 MB/s
test aes128_encrypt_blocks ... bench:         944 ns/iter (+/- 23) = 17355 MB/s
test aes128_new            ... bench:           9 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       1,299 ns/iter (+/- 5) = 12612 MB/s
test aes192_decrypt_blocks ... bench:       1,142 ns/iter (+/- 78) = 14346 MB/s
test aes192_encrypt_block  ... bench:       1,300 ns/iter (+/- 2) = 12603 MB/s
test aes192_encrypt_blocks ... bench:       1,142 ns/iter (+/- 24) = 14346 MB/s
test aes192_new            ... bench:          10 ns/iter (+/- 0)
test aes256_decrypt_block  ... bench:       1,622 ns/iter (+/- 18) = 10101 MB/s
test aes256_decrypt_blocks ... bench:       1,330 ns/iter (+/- 84) = 12318 MB/s
test aes256_encrypt_block  ... bench:       1,622 ns/iter (+/- 5) = 10101 MB/s
test aes256_encrypt_blocks ... bench:       1,330 ns/iter (+/- 86) = 12318 MB/s
test aes256_new            ... bench:          12 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 13.07

RUSTFLAGS="-C target-cpu=native" cargo bench --features vaes:

running 15 tests
test aes128_decrypt_block  ... bench:       1,040 ns/iter (+/- 19) = 15753 MB/s
test aes128_decrypt_blocks ... bench:         464 ns/iter (+/- 7) = 35310 MB/s
test aes128_encrypt_block  ... bench:       1,039 ns/iter (+/- 25) = 15769 MB/s
test aes128_encrypt_blocks ... bench:         464 ns/iter (+/- 7) = 35310 MB/s
test aes128_new            ... bench:           9 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       1,300 ns/iter (+/- 49) = 12603 MB/s
test aes192_decrypt_blocks ... bench:         556 ns/iter (+/- 7) = 29467 MB/s
test aes192_encrypt_block  ... bench:       1,295 ns/iter (+/- 28) = 12651 MB/s
test aes192_encrypt_blocks ... bench:         557 ns/iter (+/- 8) = 29414 MB/s
test aes192_new            ... bench:          10 ns/iter (+/- 0)
test aes256_decrypt_block  ... bench:       1,619 ns/iter (+/- 58) = 10119 MB/s
test aes256_decrypt_blocks ... bench:         650 ns/iter (+/- 7) = 25206 MB/s
test aes256_encrypt_block  ... bench:       1,616 ns/iter (+/- 33) = 10138 MB/s
test aes256_encrypt_blocks ... bench:         649 ns/iter (+/- 6) = 25244 MB/s
test aes256_new            ... bench:          12 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 15.20s

I experimented with changing ParBlocksSize to 32 and unfolding the loop more for the VAES case to see if it made a difference, but at least on Zen4 it didn't seem to matter.

One thing I noticed is that it is quite important that the target-cpu is set correctly, otherwise the performance can be bad:

cargo bench --features vaes:

running 15 tests
test aes128_decrypt_block  ... bench:       1,308 ns/iter (+/- 44) = 12525 MB/s
test aes128_decrypt_blocks ... bench:      18,713 ns/iter (+/- 694) = 875 MB/s
test aes128_encrypt_block  ... bench:       1,340 ns/iter (+/- 10) = 12226 MB/s
test aes128_encrypt_blocks ... bench:      18,676 ns/iter (+/- 569) = 877 MB/s
test aes128_new            ... bench:          26 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       1,530 ns/iter (+/- 16) = 10708 MB/s
test aes192_decrypt_blocks ... bench:      21,871 ns/iter (+/- 754) = 749 MB/s
test aes192_encrypt_block  ... bench:       1,531 ns/iter (+/- 11) = 10701 MB/s
test aes192_encrypt_blocks ... bench:      22,029 ns/iter (+/- 736) = 743 MB/s
test aes192_new            ... bench:          30 ns/iter (+/- 0)
test aes256_decrypt_block  ... bench:       1,777 ns/iter (+/- 67) = 9220 MB/s
test aes256_decrypt_blocks ... bench:      25,237 ns/iter (+/- 1,035) = 649 MB/s
test aes256_encrypt_block  ... bench:       1,741 ns/iter (+/- 53) = 9410 MB/s
test aes256_encrypt_blocks ... bench:      25,090 ns/iter (+/- 1,023) = 653 MB/s
test aes256_new            ... bench:          79 ns/iter (+/- 2)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 53.70s

Regarding adding the vaes feature to the Cargo.toml, rather than using cpufeatures, I couldn't figure out a way to structure the addition of this functionality cleanly otherwise.

This is partly due to the fact that some of the instructions are gated behind stdsimd.

Also, as noted in another thread on the Rust forums, there isn't really a way to handle the negation of a case for target_feature, so it would be difficult to figure out how to override the selection of the usual ni 8-fold operations with the vaes versions.

But if anyone has suggestions on how to structure this better I'd be happy to make those changes.

tarcieri commented 10 months ago

Regarding adding the vaes feature to the Cargo.toml, rather than using cpufeatures, I couldn't figure out a way to structure the addition of this functionality cleanly otherwise.

@silvanshade I'd definitely recommend trying to get cpufeatures working. The latest v0.2.12 release just added support for detecting VAES.

Structurally it'd look pretty much like what you have, but you'd have both the ni and vaes modules linked on x86-ish targets.

You'd need to add detection for VAES, and a branch to use it if available.

If that's not something you're particularly interested in, we can work with this and @newpavlov or myself can complete it.

Either way, thanks!

silvanshade commented 10 months ago

@silvanshade I'd definitely recommend trying to get cpufeatures working. The latest v0.2.12 release just added support for detecting VAES.

Yeah, I had created the PR that added that, since I was originally going to try and use cpufeatures for this.

You'd need to add detection for VAES, and a branch to use it if available.

I can create a branch like this. That's not the main difficulty, as I understand it.

Rather, the VAES features aren't usable without also enabling stdsimd and avx512_target_feature, which I added to the top of lib.rs.

I could change it to where there is a --cfg vaes rather than the cargo feature, so it works more like the aes_armv8 or something, but there still needs to be something to conditionally enable those features, otherwise aes won't even be able to compile on stable anymore.

I think maybe the only way to add support for this without some sort of feature gating would be to locally define stable versions of the VAES intrinsics like what was done for the ARM backend.

What do you think?

tarcieri commented 10 months ago

I think maybe the only way to add support for this without some sort of feature gating would be to locally define stable versions of the VAES intrinsics like what was done for the ARM backend.

That sounds great!

silvanshade commented 10 months ago

I think maybe the only way to add support for this without some sort of feature gating would be to locally define stable versions of the VAES intrinsics like what was done for the ARM backend.

That sounds great!

So it is possible to write local versions of these intrinsics, e.g.,

#[inline]
#[target_feature(enable = "avx512f")]
pub(super) unsafe fn pf_mm512_aesdec_epi128(data: __m512i, round_key: __m512i) -> __m512i {
    let result: __m512i;
    asm!(
        "vaesdec {result}, {data}, {round_key}",
        data = in(zmm_reg) data,
        round_key = in(zmm_reg) round_key,
        result = out(zmm_reg) result,
        options(pure, nomem, nostack, preserves_flags)
    );
    result
}

But what I didn't realize is that it's still necessary to have the #[target_feature(enable = "avx512f")], which in turn requires #![feature(avx512_target_feature)] (which is unstable) in order to use the AVX registers. So I think at minimum we would probably need to add a "nightly" feature or something to gate these on and can't quite get away with just autodetection and local versions of the intrinsics.

If that seems reasonable, I will add that feature ("nightly", or "unstable", or whatever else you'd prefer to call it) and then implement the autodetection for vaes.

silvanshade commented 10 months ago

I think it will be better to write a separate implementation in the vaes module instead of piggybacking on the ni module. It's also probably worth to increase number of blocks processed in parallel for the VAES backend. Right now, you call only two aesdec/aesenc functions per round, thus potentially loosing on additional ILP-based throughput (the instructions have latency of 3 cycles and throughput of 1 cycle). Additionally, with AVX-512 you have 32 ZMM registers, so you have less register pressure.

Initially I was planning to do that but the reason I opted not to is because, for the single block case, we still basically want to fall back to the NI implementation, don't we? Maybe for maintenance or structural reasons it would be cleaner to just duplicate that code though?

I did try increasing the parallel blocks to 32 (calling 8 of the respective instructions) but didn't notice a performance difference in the benchmarks here, although I only have the one system to test on. But I agree it probably makes sense in general, and especially for a separate backend.

newpavlov commented 10 months ago

for the single block case, we still basically want to fall back to the NI implementation, don't we? Maybe for maintenance or structural reasons it would be cleaner to just duplicate that code though?

Yes, also key expansion code will be the same. But I think that the parallel processing function definitely should live in the vaes module. It will allow us to change number of blocks processed in parallel, will remove the unnecessary key broadcasts and casting between __m512i and __m256i.

So I think we should define separate backends (i.e. structs which implement the BlockBackend trait) which will use functions from the ni module for single block processing and key expansion.

tarcieri commented 10 months ago

But what I didn't realize is that it's still necessary to have the #[target_feature(enable = "avx512f")], which in turn requires #![feature(avx512_target_feature)] (which is unstable) in order to use the AVX registers.

@silvanshade aah yes, that's unfortunate. I've ran into similar issues in the past and the only way I solved it was opening a stabilization PR for the relevant target features (which in the past I did manage to get one merged), although offhand I'm not sure what the blockers are.

Not seeing much discussion here either: https://github.com/rust-lang/rust/issues/44839

silvanshade commented 9 months ago

@tarcieri @newpavlov I’ve updated the PR and tried to address prior feedback.

There’s also a companion PR for an action here

This version uses separate backend definitions in order to avoid rebroadcasting the key from 128b -> 512b for each call to encrypt/decrypt. We still have to do the broadcasting at least once, but now we can limit that to just the key schedule functions and avoid the additional overhead for parallel encrypt/decrypt.

One thing I did not address is trying to merge the VAES backend into the autodetect framework.

The reason for this basically is that: although we can dynamically select the algorithm at runtime using cpu features, we are still (with the current type structure) limited by the types we can use, fixed at compile time.

This is a problem specifically having to do with the key size. For instance, if we wanted to have a backend which dynamically selected between AESNI or VAES, we have to compromise on either using __m128i for the round keys (and broadcasting to __m512i frequently), or using __m512i for the round keys (and casting to __m128i).

Both are problematic. Going from __m128i to __m512i is inefficient. Going the other way from __m512i to __m128i could actually potentially be viable, except for the fact that __m512i isn’t available for use in stable Rust currently. So going that route would force even the AESNI backend to require nightly.

Given that, I thought it would be best to just keep the backends separate for the time being.

In order to use the VAES backend, the target_feature=+vaes must be specified, and a nightly toolchain must be used. There are more details in the comments I added.

Also, I increased the block size for VAES to 64. Going from 32 to 64 doesn’t seem to make any difference on my system, but then neither did going from 8 to 32. But potentially it could make a difference somewhere. I suspect the reason a difference isn’t noticeable though is because the compiler is probably doing a decent job of unrolling the loops already, at least for these tests.

newpavlov commented 9 months ago

This version uses separate backend definitions in order to avoid rebroadcasting the key from 128b -> 512b for each call to encrypt/decrypt.

I don't think it's worth to store broadcasted keys as part of Aes* states. It could be better to store them only in backends, i.e. instead of reference to an Aes* state they could store broadcasted copy of round keys.

although we can dynamically select the algorithm at runtime using cpu features, we are still (with the current type structure) limited by the types we can use, fixed at compile time.

I think the only way for working around this is instead of using polyfills to implement encrypt/decrypt functions as one asm! block with explicitly named registers (you can use macros to reduce amount of boilerplate). This way the code will have no mentions of __m512i types, which require the unstable target features, but it would mean that we have to do round key broadcasts on each encrypt/decrypt. It's an unfortunate cost, but I think we can live with it until AVX-512 intrinsics get stabilized. I think (but not 100% sure) that clobber_abi("C") should properly handle clobbering of ZMM registers.

silvanshade commented 9 months ago

This version uses separate backend definitions in order to avoid rebroadcasting the key from 128b -> 512b for each call to encrypt/decrypt.

I don't think it's worth to store broadcasted keys as part of Aes* states. It could be better to store them only in backends, i.e. instead of reference to an Aes* state they could store broadcasted copy of round keys.

Can you elaborate on this? I'm not entirely sure I understand what this would look like or why this would be beneficial.

Is the idea that this would make it to where the non-broadcasted round keys are still available for proc_block and only proc_par_blocks would use the broadcasted round keys?

newpavlov commented 9 months ago

The main reason is that it would quadruple the size of Aes* states. Even worse, with enabled autodetection it would affect targets without AVX-512 (remember that we use union in this case). And since broadcasted keys contain simple copies, it feels quite wasteful.

Is the idea that this would make it to where the non-broadcasted round keys are still available for proc_block and only proc_par_blocks would use the broadcasted round keys?

Yes. Instead of this:

struct $name_enc {
    round_keys: [__m512i; $rounds],
}

struct $name_back_enc<'a>(&'a $name_enc);

It would be better to write this:

struct $name_enc {
    round_keys: [__m128i; $rounds],
}

struct $name_back_enc<'a> {
    // Owned copy of broadcasted round keys
    k1: [__m512i; $rounds],
    // References $name_enc
    k2: &'a [__m128i; $rounds],
}

During parallel block processing the broadcasted round keys are likely to stay in registers and may not be even spilled to stack (assuming you will use an appropriate value for ParBlocksSize).

silvanshade commented 9 months ago

Okay, I tried refactoring how you suggested.

Initially, the results were a little surprising, because the single block case was suddenly far slower than before the refactoring.

Before splitting the key representation:

running 15 tests
test aes128_decrypt_block  ... bench:       1,333 ns/iter (+/- 28) = 12291 MB/s
test aes128_decrypt_blocks ... bench:         474 ns/iter (+/- 31) = 34565 MB/s
test aes128_encrypt_block  ... bench:       1,310 ns/iter (+/- 32) = 12506 MB/s
test aes128_encrypt_blocks ... bench:         474 ns/iter (+/- 7) = 34565 MB/s
test aes128_new            ... bench:          36 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       1,514 ns/iter (+/- 38) = 10821 MB/s
test aes192_decrypt_blocks ... bench:         567 ns/iter (+/- 7) = 28895 MB/s
test aes192_encrypt_block  ... bench:       1,510 ns/iter (+/- 53) = 10850 MB/s
test aes192_encrypt_blocks ... bench:         566 ns/iter (+/- 8) = 28946 MB/s
test aes192_new            ... bench:          38 ns/iter (+/- 0)
test aes256_decrypt_block  ... bench:       1,713 ns/iter (+/- 44) = 9564 MB/s
test aes256_decrypt_blocks ... bench:         659 ns/iter (+/- 7) = 24861 MB/s
test aes256_encrypt_block  ... bench:       1,724 ns/iter (+/- 25) = 9503 MB/s
test aes256_encrypt_blocks ... bench:         656 ns/iter (+/- 10) = 24975 MB/s
test aes256_new            ... bench:          48 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 20.03s

After splitting the key representation:

running 15 tests
test aes128_decrypt_block  ... bench:       6,409 ns/iter (+/- 71) = 2556 MB/s
test aes128_decrypt_blocks ... bench:         475 ns/iter (+/- 70) = 34492 MB/s
test aes128_encrypt_block  ... bench:       6,379 ns/iter (+/- 59) = 2568 MB/s
test aes128_encrypt_blocks ... bench:         471 ns/iter (+/- 5) = 34785 MB/s
test aes128_new            ... bench:           9 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       8,038 ns/iter (+/- 140) = 2038 MB/s
test aes192_decrypt_blocks ... bench:         567 ns/iter (+/- 34) = 28895 MB/s
test aes192_encrypt_block  ... bench:       7,988 ns/iter (+/- 73) = 2051 MB/s
test aes192_encrypt_blocks ... bench:         564 ns/iter (+/- 8) = 29049 MB/s
test aes192_new            ... bench:          11 ns/iter (+/- 0)
test aes256_decrypt_block  ... bench:       9,397 ns/iter (+/- 281) = 1743 MB/s
test aes256_decrypt_blocks ... bench:         661 ns/iter (+/- 32) = 24786 MB/s
test aes256_encrypt_block  ... bench:       9,407 ns/iter (+/- 267) = 1741 MB/s
test aes256_encrypt_blocks ... bench:         657 ns/iter (+/- 24) = 24937 MB/s
test aes256_new            ... bench:          12 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 11.98s

The only thing that really changed was that I moved the call to .map on [__m128i; N] array from the expand_key and inv_expanded_keys functions to the get_enc_backend and get_dec_backend functions. But also the original array was kept around now (as a copy), whereas before it was consumed.

This made me suspect maybe the compiler was able to optimize the previous case better.

I tried switching to a slightly different representation where the parallel keys are lazily initialized, only if proc_par_blocks is called, and indeed that seemed to restore the original performance:

running 15 tests
test aes128_decrypt_block  ... bench:       1,257 ns/iter (+/- 31) = 13034 MB/s
test aes128_decrypt_blocks ... bench:         472 ns/iter (+/- 19) = 34711 MB/s
test aes128_encrypt_block  ... bench:       1,254 ns/iter (+/- 58) = 13065 MB/s
test aes128_encrypt_blocks ... bench:         482 ns/iter (+/- 25) = 33991 MB/s
test aes128_new            ... bench:           9 ns/iter (+/- 0)
test aes192_decrypt_block  ... bench:       1,443 ns/iter (+/- 44) = 11354 MB/s
test aes192_decrypt_blocks ... bench:         580 ns/iter (+/- 30) = 28248 MB/s
test aes192_encrypt_block  ... bench:       1,444 ns/iter (+/- 32) = 11346 MB/s
test aes192_encrypt_blocks ... bench:         590 ns/iter (+/- 37) = 27769 MB/s
test aes192_new            ... bench:          11 ns/iter (+/- 0)
test aes256_decrypt_block  ... bench:       1,671 ns/iter (+/- 46) = 9804 MB/s
test aes256_decrypt_blocks ... bench:         673 ns/iter (+/- 15) = 24344 MB/s
test aes256_encrypt_block  ... bench:       1,673 ns/iter (+/- 37) = 9793 MB/s
test aes256_encrypt_blocks ... bench:         686 ns/iter (+/- 15) = 23883 MB/s
test aes256_new            ... bench:          13 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 28.50s

newpavlov commented 9 months ago

The only thing that really changed was that I moved the call to .map on [__m128i; N] array from the expand_key and inv_expanded_keys functions to the get_enc_backend and get_dec_backend functions. But also the original array was kept around now (as a copy), whereas before it was consumed.

Compiler also can sometimes have difficulties with optimizing array::map-based code, especially with large arrays, because the method uses additional code to be safe against potential panics in passed closure.

silvanshade commented 9 months ago

I've added VAES support to the autodetection.

This required refactoring some parts of the autodetection code to handle in a cleaner way.

In order to handle VAES on stable I resorted to using inline asm!. One downside to this approach is that the asm! macro complains that zmm registers are not usable on i686 targets (despite the fact that it's possible to compile for that with the intrinsics).

To work around that I just feature gated the VAES backend to only work on x86_64 targets. I doubt anyone will ever actually want to compile with this feature for 32-bit targets anyway, but in the future it could be possible by switching back to the intrinsics.

I didn't change the hazmat code to include VAES since I wasn't sure if you want to increase the block size there or not.

I think this addresses basically all of the feedback now?

silvanshade commented 9 months ago

Two more small changes:

I also added a VAES backend for AVX (256-bit) since it's possible (on future Intel CPUs) to have a scenario where VAES is available but AVX512 is not.
I added a --cfg disable_avx512 check which will force the 256-bit VAES backend even if AVX512 is available. This may be useful in scenarios where downclocking from AVX512 is an issue. It's also useful for benchmarking.

silvanshade commented 6 months ago

@tarcieri @newpavlov Do you intend to merge this?

tarcieri commented 6 months ago

I'd generally be in favor but it's definitely a large PR. Sorry it's gone by the wayside. I will hopefully have time to review soon. Also curious to know what @newpavlov thinks.

silvanshade commented 6 months ago

Thanks.

I would like to resume working the RISC-V and ARMv9 PRs (especially the latter will be relevant soon since Apple Silicon M4 is ARMv9 with SVE2/SME) but prefer to see how this one lands first before putting a lot more effort into those.

silvanshade commented 5 months ago

I've made several changes since the recent feedback:

I completely redesigned the backend code to use distinct types rather than coalescing them for simplicity. As noted in some discussion above, this required modifying part of the interace between autodetect and the architecture modules.
I changed the clobbers to be explicit.
For VAES 256, I moved the loading/store out of the inline asm!, which allowed adding pure, readonly for those blocks.

At this point I would actually prefer not to focus much more on refactoring the algorithms (re: experimenting with block counts, broadcasting, etc).

I've put quite a lot of time into this PR already and the performance gains are pretty reasonable I think. There's always room in the future for more fine-tuning.

I'm still willing to address remaining design issues though.

newpavlov commented 5 months ago

@silvanshade Yes, I think it's better to experiment with minor modifications in separate PRs. I will try to fully review the code this week (likely during weekend) and probably will merge it after that (I can fix minor nits myself if needed).

silvanshade commented 5 months ago

@tarcieri @newpavlov Any updates on this?

tarcieri commented 5 months ago

@silvanshade why did you close this? It seemed pretty close to complete.

silvanshade commented 5 months ago

I closed it because I still haven't gotten a thorough review and discussion about the implementation, even though I've repeatedly addressed all of the smaller feedback to the best of my ability.

From my perspective, there is no real evidence that this PR is "close to complete".

I thought it was basically complete months ago and asked for feedback then, and waited, and nothing happened.

I realize that maintainers are often very busy with other things but I think that it should have been possible by now to get a more concrete idea about whether this is ever likely to be merged and if not, what are the blockers.

The last substantive exchange with @newpavlov suggested I fundamentally misunderstood something about the implementation, and that was never clarified.

So I just don't think it's a good use of time to continue.

If you think otherwise, what would you suggest?

tarcieri commented 5 months ago

@newpavlov's last comment, as of two weeks ago, was:

I will try to fully review the code this week (likely during weekend) and probably will merge it after that (I can fix minor nits myself if needed).

It sounds like he wanted to just do one final pass before merging.

@silvanshade can you please reopen and we can get this merged?

silvanshade commented 5 months ago

@silvanshade can you please reopen and we can get this merged?

I think it would be more productive to re-open it if or when there's a final review.

newpavlov commented 5 months ago

Sorry for the delay! I couldn't find enough time during the previous weekend, so I will try again on this one.

Closing PR makes it less visible and increases chances of forgetting about it, so I will reopen.

silvanshade commented 5 months ago

@newpavlov Thanks for the update. Unfortunately I've deleted the branch and no longer wish to contribute to this project.

RustCrypto / block-ciphers

Add support for using VAES instructions for NI parallel operations. #396