Closed silvanshade closed 5 months ago
Regarding adding the vaes feature to the Cargo.toml, rather than using cpufeatures, I couldn't figure out a way to structure the addition of this functionality cleanly otherwise.
@silvanshade I'd definitely recommend trying to get cpufeatures
working. The latest v0.2.12 release just added support for detecting VAES.
Structurally it'd look pretty much like what you have, but you'd have both the ni
and vaes
modules linked on x86-ish targets.
You'd need to add detection for VAES, and a branch to use it if available.
If that's not something you're particularly interested in, we can work with this and @newpavlov or myself can complete it.
Either way, thanks!
@silvanshade I'd definitely recommend trying to get
cpufeatures
working. The latest v0.2.12 release just added support for detecting VAES.
Yeah, I had created the PR that added that, since I was originally going to try and use cpufeatures
for this.
You'd need to add detection for VAES, and a branch to use it if available.
I can create a branch like this. That's not the main difficulty, as I understand it.
Rather, the VAES
features aren't usable without also enabling stdsimd
and avx512_target_feature
, which I added to the top of lib.rs
.
I could change it to where there is a --cfg vaes
rather than the cargo feature, so it works more like the aes_armv8
or something, but there still needs to be something to conditionally enable those features, otherwise aes
won't even be able to compile on stable
anymore.
I think maybe the only way to add support for this without some sort of feature gating would be to locally define stable versions of the VAES
intrinsics like what was done for the ARM backend.
What do you think?
I think maybe the only way to add support for this without some sort of feature gating would be to locally define stable versions of the VAES intrinsics like what was done for the ARM backend.
That sounds great!
I think maybe the only way to add support for this without some sort of feature gating would be to locally define stable versions of the VAES intrinsics like what was done for the ARM backend.
That sounds great!
So it is possible to write local versions of these intrinsics, e.g.,
#[inline]
#[target_feature(enable = "avx512f")]
pub(super) unsafe fn pf_mm512_aesdec_epi128(data: __m512i, round_key: __m512i) -> __m512i {
let result: __m512i;
asm!(
"vaesdec {result}, {data}, {round_key}",
data = in(zmm_reg) data,
round_key = in(zmm_reg) round_key,
result = out(zmm_reg) result,
options(pure, nomem, nostack, preserves_flags)
);
result
}
But what I didn't realize is that it's still necessary to have the #[target_feature(enable = "avx512f")]
, which in turn requires #![feature(avx512_target_feature)]
(which is unstable
) in order to use the AVX registers. So I think at minimum we would probably need to add a "nightly"
feature or something to gate these on and can't quite get away with just autodetection and local versions of the intrinsics.
If that seems reasonable, I will add that feature ("nightly"
, or "unstable"
, or whatever else you'd prefer to call it) and then implement the autodetection for vaes
.
I think it will be better to write a separate implementation in the
vaes
module instead of piggybacking on theni
module. It's also probably worth to increase number of blocks processed in parallel for the VAES backend. Right now, you call only twoaesdec
/aesenc
functions per round, thus potentially loosing on additional ILP-based throughput (the instructions have latency of 3 cycles and throughput of 1 cycle). Additionally, with AVX-512 you have 32 ZMM registers, so you have less register pressure.
Initially I was planning to do that but the reason I opted not to is because, for the single block case, we still basically want to fall back to the NI
implementation, don't we? Maybe for maintenance or structural reasons it would be cleaner to just duplicate that code though?
I did try increasing the parallel blocks to 32 (calling 8 of the respective instructions) but didn't notice a performance difference in the benchmarks here, although I only have the one system to test on. But I agree it probably makes sense in general, and especially for a separate backend.
for the single block case, we still basically want to fall back to the NI implementation, don't we? Maybe for maintenance or structural reasons it would be cleaner to just duplicate that code though?
Yes, also key expansion code will be the same. But I think that the parallel processing function definitely should live in the vaes
module. It will allow us to change number of blocks processed in parallel, will remove the unnecessary key broadcasts and casting between __m512i
and __m256i
.
So I think we should define separate backends (i.e. structs which implement the BlockBackend
trait) which will use functions from the ni
module for single block processing and key expansion.
But what I didn't realize is that it's still necessary to have the #[target_feature(enable = "avx512f")], which in turn requires #![feature(avx512_target_feature)] (which is unstable) in order to use the AVX registers.
@silvanshade aah yes, that's unfortunate. I've ran into similar issues in the past and the only way I solved it was opening a stabilization PR for the relevant target features (which in the past I did manage to get one merged), although offhand I'm not sure what the blockers are.
Not seeing much discussion here either: https://github.com/rust-lang/rust/issues/44839
@tarcieri @newpavlov I’ve updated the PR and tried to address prior feedback.
There’s also a companion PR for an action here
This version uses separate backend definitions in order to avoid rebroadcasting the key from 128b -> 512b for each call to encrypt/decrypt. We still have to do the broadcasting at least once, but now we can limit that to just the key schedule functions and avoid the additional overhead for parallel encrypt/decrypt.
One thing I did not address is trying to merge the VAES backend into the autodetect framework.
The reason for this basically is that: although we can dynamically select the algorithm at runtime using cpu features, we are still (with the current type structure) limited by the types we can use, fixed at compile time.
This is a problem specifically having to do with the key size. For instance, if we wanted to have a backend which dynamically selected between AESNI or VAES, we have to compromise on either using __m128i
for the round keys (and broadcasting to __m512i
frequently), or using __m512i
for the round keys (and casting to __m128i
).
Both are problematic. Going from __m128i
to __m512i
is inefficient. Going the other way from __m512i
to __m128i
could actually potentially be viable, except for the fact that __m512i
isn’t available for use in stable Rust currently. So going that route would force even the AESNI backend to require nightly.
Given that, I thought it would be best to just keep the backends separate for the time being.
In order to use the VAES backend, the target_feature=+vaes
must be specified, and a nightly toolchain must be used. There are more details in the comments I added.
Also, I increased the block size for VAES to 64. Going from 32 to 64 doesn’t seem to make any difference on my system, but then neither did going from 8 to 32. But potentially it could make a difference somewhere. I suspect the reason a difference isn’t noticeable though is because the compiler is probably doing a decent job of unrolling the loops already, at least for these tests.
This version uses separate backend definitions in order to avoid rebroadcasting the key from 128b -> 512b for each call to encrypt/decrypt.
I don't think it's worth to store broadcasted keys as part of Aes*
states. It could be better to store them only in backends, i.e. instead of reference to an Aes*
state they could store broadcasted copy of round keys.
although we can dynamically select the algorithm at runtime using cpu features, we are still (with the current type structure) limited by the types we can use, fixed at compile time.
I think the only way for working around this is instead of using polyfills to implement encrypt/decrypt functions as one asm!
block with explicitly named registers (you can use macros to reduce amount of boilerplate). This way the code will have no mentions of __m512i
types, which require the unstable target features, but it would mean that we have to do round key broadcasts on each encrypt/decrypt. It's an unfortunate cost, but I think we can live with it until AVX-512 intrinsics get stabilized. I think (but not 100% sure) that clobber_abi("C")
should properly handle clobbering of ZMM registers.
This version uses separate backend definitions in order to avoid rebroadcasting the key from 128b -> 512b for each call to encrypt/decrypt.
I don't think it's worth to store broadcasted keys as part of
Aes*
states. It could be better to store them only in backends, i.e. instead of reference to anAes*
state they could store broadcasted copy of round keys.
Can you elaborate on this? I'm not entirely sure I understand what this would look like or why this would be beneficial.
Is the idea that this would make it to where the non-broadcasted round keys are still available for proc_block
and only proc_par_blocks
would use the broadcasted round keys?
The main reason is that it would quadruple the size of Aes*
states. Even worse, with enabled autodetection it would affect targets without AVX-512 (remember that we use union
in this case). And since broadcasted keys contain simple copies, it feels quite wasteful.
Is the idea that this would make it to where the non-broadcasted round keys are still available for proc_block and only proc_par_blocks would use the broadcasted round keys?
Yes. Instead of this:
struct $name_enc {
round_keys: [__m512i; $rounds],
}
struct $name_back_enc<'a>(&'a $name_enc);
It would be better to write this:
struct $name_enc {
round_keys: [__m128i; $rounds],
}
struct $name_back_enc<'a> {
// Owned copy of broadcasted round keys
k1: [__m512i; $rounds],
// References $name_enc
k2: &'a [__m128i; $rounds],
}
During parallel block processing the broadcasted round keys are likely to stay in registers and may not be even spilled to stack (assuming you will use an appropriate value for ParBlocksSize
).
Okay, I tried refactoring how you suggested.
Initially, the results were a little surprising, because the single block case was suddenly far slower than before the refactoring.
Before splitting the key representation:
running 15 tests
test aes128_decrypt_block ... bench: 1,333 ns/iter (+/- 28) = 12291 MB/s
test aes128_decrypt_blocks ... bench: 474 ns/iter (+/- 31) = 34565 MB/s
test aes128_encrypt_block ... bench: 1,310 ns/iter (+/- 32) = 12506 MB/s
test aes128_encrypt_blocks ... bench: 474 ns/iter (+/- 7) = 34565 MB/s
test aes128_new ... bench: 36 ns/iter (+/- 0)
test aes192_decrypt_block ... bench: 1,514 ns/iter (+/- 38) = 10821 MB/s
test aes192_decrypt_blocks ... bench: 567 ns/iter (+/- 7) = 28895 MB/s
test aes192_encrypt_block ... bench: 1,510 ns/iter (+/- 53) = 10850 MB/s
test aes192_encrypt_blocks ... bench: 566 ns/iter (+/- 8) = 28946 MB/s
test aes192_new ... bench: 38 ns/iter (+/- 0)
test aes256_decrypt_block ... bench: 1,713 ns/iter (+/- 44) = 9564 MB/s
test aes256_decrypt_blocks ... bench: 659 ns/iter (+/- 7) = 24861 MB/s
test aes256_encrypt_block ... bench: 1,724 ns/iter (+/- 25) = 9503 MB/s
test aes256_encrypt_blocks ... bench: 656 ns/iter (+/- 10) = 24975 MB/s
test aes256_new ... bench: 48 ns/iter (+/- 1)
test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 20.03s
After splitting the key representation:
running 15 tests
test aes128_decrypt_block ... bench: 6,409 ns/iter (+/- 71) = 2556 MB/s
test aes128_decrypt_blocks ... bench: 475 ns/iter (+/- 70) = 34492 MB/s
test aes128_encrypt_block ... bench: 6,379 ns/iter (+/- 59) = 2568 MB/s
test aes128_encrypt_blocks ... bench: 471 ns/iter (+/- 5) = 34785 MB/s
test aes128_new ... bench: 9 ns/iter (+/- 0)
test aes192_decrypt_block ... bench: 8,038 ns/iter (+/- 140) = 2038 MB/s
test aes192_decrypt_blocks ... bench: 567 ns/iter (+/- 34) = 28895 MB/s
test aes192_encrypt_block ... bench: 7,988 ns/iter (+/- 73) = 2051 MB/s
test aes192_encrypt_blocks ... bench: 564 ns/iter (+/- 8) = 29049 MB/s
test aes192_new ... bench: 11 ns/iter (+/- 0)
test aes256_decrypt_block ... bench: 9,397 ns/iter (+/- 281) = 1743 MB/s
test aes256_decrypt_blocks ... bench: 661 ns/iter (+/- 32) = 24786 MB/s
test aes256_encrypt_block ... bench: 9,407 ns/iter (+/- 267) = 1741 MB/s
test aes256_encrypt_blocks ... bench: 657 ns/iter (+/- 24) = 24937 MB/s
test aes256_new ... bench: 12 ns/iter (+/- 0)
test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 11.98s
The only thing that really changed was that I moved the call to .map
on [__m128i; N]
array from the expand_key
and inv_expanded_keys
functions to the get_enc_backend
and get_dec_backend
functions. But also the original array was kept around now (as a copy), whereas before it was consumed.
This made me suspect maybe the compiler was able to optimize the previous case better.
I tried switching to a slightly different representation where the parallel keys are lazily initialized, only if proc_par_blocks
is called, and indeed that seemed to restore the original performance:
running 15 tests
test aes128_decrypt_block ... bench: 1,257 ns/iter (+/- 31) = 13034 MB/s
test aes128_decrypt_blocks ... bench: 472 ns/iter (+/- 19) = 34711 MB/s
test aes128_encrypt_block ... bench: 1,254 ns/iter (+/- 58) = 13065 MB/s
test aes128_encrypt_blocks ... bench: 482 ns/iter (+/- 25) = 33991 MB/s
test aes128_new ... bench: 9 ns/iter (+/- 0)
test aes192_decrypt_block ... bench: 1,443 ns/iter (+/- 44) = 11354 MB/s
test aes192_decrypt_blocks ... bench: 580 ns/iter (+/- 30) = 28248 MB/s
test aes192_encrypt_block ... bench: 1,444 ns/iter (+/- 32) = 11346 MB/s
test aes192_encrypt_blocks ... bench: 590 ns/iter (+/- 37) = 27769 MB/s
test aes192_new ... bench: 11 ns/iter (+/- 0)
test aes256_decrypt_block ... bench: 1,671 ns/iter (+/- 46) = 9804 MB/s
test aes256_decrypt_blocks ... bench: 673 ns/iter (+/- 15) = 24344 MB/s
test aes256_encrypt_block ... bench: 1,673 ns/iter (+/- 37) = 9793 MB/s
test aes256_encrypt_blocks ... bench: 686 ns/iter (+/- 15) = 23883 MB/s
test aes256_new ... bench: 13 ns/iter (+/- 1)
test result: ok. 0 passed; 0 failed; 0 ignored; 15 measured; 0 filtered out; finished in 28.50s
The only thing that really changed was that I moved the call to .map on [__m128i; N] array from the expand_key and inv_expanded_keys functions to the get_enc_backend and get_dec_backend functions. But also the original array was kept around now (as a copy), whereas before it was consumed.
Compiler also can sometimes have difficulties with optimizing array::map
-based code, especially with large arrays, because the method uses additional code to be safe against potential panics in passed closure.
I've added VAES
support to the autodetection.
This required refactoring some parts of the autodetection code to handle in a cleaner way.
In order to handle VAES
on stable I resorted to using inline asm!
. One downside to this approach is that the asm!
macro complains that zmm
registers are not usable on i686
targets (despite the fact that it's possible to compile for that with the intrinsics).
To work around that I just feature gated the VAES
backend to only work on x86_64
targets. I doubt anyone will ever actually want to compile with this feature for 32-bit targets anyway, but in the future it could be possible by switching back to the intrinsics.
I didn't change the hazmat
code to include VAES
since I wasn't sure if you want to increase the block size there or not.
I think this addresses basically all of the feedback now?
Two more small changes:
--cfg disable_avx512
check which will force the 256-bit VAES backend even if AVX512 is available. This may be useful in scenarios where downclocking from AVX512 is an issue. It's also useful for benchmarking.@tarcieri @newpavlov Do you intend to merge this?
I'd generally be in favor but it's definitely a large PR. Sorry it's gone by the wayside. I will hopefully have time to review soon. Also curious to know what @newpavlov thinks.
Thanks.
I would like to resume working the RISC-V and ARMv9 PRs (especially the latter will be relevant soon since Apple Silicon M4 is ARMv9 with SVE2/SME) but prefer to see how this one lands first before putting a lot more effort into those.
I've made several changes since the recent feedback:
autodetect
and the architecture modules.asm!
, which allowed adding pure, readonly
for those blocks.At this point I would actually prefer not to focus much more on refactoring the algorithms (re: experimenting with block counts, broadcasting, etc).
I've put quite a lot of time into this PR already and the performance gains are pretty reasonable I think. There's always room in the future for more fine-tuning.
I'm still willing to address remaining design issues though.
@silvanshade Yes, I think it's better to experiment with minor modifications in separate PRs. I will try to fully review the code this week (likely during weekend) and probably will merge it after that (I can fix minor nits myself if needed).
@tarcieri @newpavlov Any updates on this?
@silvanshade why did you close this? It seemed pretty close to complete.
I closed it because I still haven't gotten a thorough review and discussion about the implementation, even though I've repeatedly addressed all of the smaller feedback to the best of my ability.
From my perspective, there is no real evidence that this PR is "close to complete".
I thought it was basically complete months ago and asked for feedback then, and waited, and nothing happened.
I realize that maintainers are often very busy with other things but I think that it should have been possible by now to get a more concrete idea about whether this is ever likely to be merged and if not, what are the blockers.
The last substantive exchange with @newpavlov suggested I fundamentally misunderstood something about the implementation, and that was never clarified.
So I just don't think it's a good use of time to continue.
If you think otherwise, what would you suggest?
@newpavlov's last comment, as of two weeks ago, was:
I will try to fully review the code this week (likely during weekend) and probably will merge it after that (I can fix minor nits myself if needed).
It sounds like he wanted to just do one final pass before merging.
@silvanshade can you please reopen and we can get this merged?
@silvanshade can you please reopen and we can get this merged?
I think it would be more productive to re-open it if or when there's a final review.
Sorry for the delay! I couldn't find enough time during the previous weekend, so I will try again on this one.
Closing PR makes it less visible and increases chances of forgetting about it, so I will reopen.
@newpavlov Thanks for the update. Unfortunately I've deleted the branch and no longer wish to contribute to this project.
This PR adds support for using
VAES
intrinsics for theni
backend for theaes
8-fold operations.The change shows a nice speed up on Zen4 CPUs at least.
Benchmarks (Ryzen 7950x):
RUSTFLAGS="-C target-cpu=native" cargo bench
:RUSTFLAGS="-C target-cpu=native" cargo bench --features vaes
:I experimented with changing
ParBlocksSize
to32
and unfolding the loop more for theVAES
case to see if it made a difference, but at least on Zen4 it didn't seem to matter.One thing I noticed is that it is quite important that the
target-cpu
is set correctly, otherwise the performance can be bad:cargo bench --features vaes
:Regarding adding the
vaes
feature to theCargo.toml
, rather than usingcpufeatures
, I couldn't figure out a way to structure the addition of this functionality cleanly otherwise.This is partly due to the fact that some of the instructions are gated behind
stdsimd
.Also, as noted in another thread on the Rust forums, there isn't really a way to handle the negation of a case for
target_feature
, so it would be difficult to figure out how to override the selection of the usualni
8-fold operations with thevaes
versions.But if anyone has suggestions on how to structure this better I'd be happy to make those changes.