Closed h3nd24 closed 2 years ago
Looking through the flamegraphs, the issue seems to be in bufferCrypt() https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/javax/crypto/CipherSpi.java#L749 The function takes 15.3% of all samples on non-Graviton and 28.59% of samples on Graviton. bufferCrypt() calls more native functions in non-Graviton: jbyte_disjoint_array, ghash_processBlocks, counterMode_AESCrypt than on Graviton: aescrypt_encrypt We need to dive in and see which of the native functions need to be ported to arm64.
Stumbled across this article and decided to give it a go https://aws.amazon.com/blogs/opensource/introducing-amazon-corretto-crypto-provider-accp/ . The result looks much better than before, whereby there are improvements for both Graviton and Non-Graviton, but the improvement on Graviton is of much higher ratio that it's raw performance is better than Non-Graviton now. Find attached the result BenchmarkingResultWithACCP.tar.gz
In a sense it seems to me ACCP is kinda doing some native instructions already (via openSSL), and that definitely brings a lot of advantages.
OpenJDK has intrinsic support on both, x86 and aarch64, for the basic AES block encryption/decryption operations. It's implemented in the corresponding generate_aescrypt_encryptBlock()
/generate_aescrypt_decryptBlock()
stubs which are used in LibraryCallKit::inline_aescrypt_Block()
as substitures for implEncryptBlock()
/implDecryptBlock()
in the class com.sun.crypto.provider.AESCrypt
.
However, x86 has some additional optimizations/intrinsics for the AES "Counter" mode (both "AES/CTR" and "AES/GCM", i.e. "Galois/Counter Mode") which are missing on aarch64 and led to the observed performance degradation on aarch64. They were introduced by the following changes:
8143925: Enhancing CounterMode.crypt() for AES Add intrinsic for CounterMode.crypt() to leverage the parallel nature of AES in Counter(CTR) Mode. http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/cb31a76eecd1 http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/72f54de44772 From the issue summary:
The request is to leverage the parallel nature of AES in Counter (CTR) Mode. In a single threaded implementation, this can be achieved by issuing independent x86 AES-NI instructions. Presently, there is an intrinsic for AESCrypt.implEncryptBlock(), which is called by CounterMode.crypt() method. However, the intrinsic works on one block at a time. The x86 AES-NI instructions have a latency of 6 or 7 clocks depending on the architecture. Since every AESENC instructions issued by this intrinsic is dependent on the earlier one, it does not take advantage of the CPU pipeline. We can optimize the performance of CounterMode.crypt() method by 4x-6x by issuing independent instructions from up to 6 blocks in parallel.
The change intrinsifies com.sun.crypto.provider.CounterMode::implCrypt()
if and only if CounterMode
's embedded cipher is of type com.sun.crypto.provider.AESCrypt
. The stub for x86_64 is implemented in generate_counterMode_AESCrypt_Parallel()
.
8177784: Use CounterMode intrinsic for AES/GCM http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/0c8f43317c1f
This is a platform independent change to extend 8143925, which initially only applied to AES/CTR, to also work for AES/GCM. Mentioned here only for completeness.
Later 8143925 was further improved for AVX512 and the Vector AES instructions
8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions https://hg.openjdk.java.net/jdk/jdk/rev/c6a789f495fe From the issue summary:
As per the Intel Architecture Instruction Set Reference, p.156-159 Vector AES (VAES) Operations will be supported in future Intel ISA. I would like to contribute an optimization for AES-CTR algorithm using AVX512+VAES instructions. This optimization is for x86_64 architecture that have AVX512-VAES enabled. I ran jtreg test suite with the algorithm on Intel SDE to confirm that encoding and semantics are correctly implemented.
The new, vectorized intrinsic for CounterMode::implCrypt() on x86_64 is implemented in generate_counterMode_VectorAESCrypt()
.
As Intel mentioned in their change for 8143925, their AES instructions have a latency of 6/7 clock cycles, so they process up to 6 blocks in parallel in 8177784 to completely fill the pipeline. Depending on the latency of the AES instructions on Graviton 2, we should implement a similar intrinsic for aarch64 as well. I've created 8267993: [aarch64] Implement intrinsic for CounterMode::implCrypt() to track this in OpenJDK upstream.
Arm® NeoverseTM N1 Software Optimization Guide, p.58 mentions the following:
4.6 AES encryption/decryption Neoverse N1 can issue two AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined) with an execution latency of two cycles. This means encryption or decryption for at least four data chunks should be interleaved for maximum performance:
AESE data0, key0 AESMC data0, data0 AESE data1, key0 AESMC data1, data1 AESE data2, key0 AESMC data2, data2 AESE data3, key1 AESMC data3, data3 AESE data0, key0 AESMC data0, data0 ...
Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when they are adjacent in the program code and both instructions use the same destination register.
Does it make sense to process more than 4 blocks in parallel?
I also read about the ARM SVE2-AES extension which can probably used to implement something similar to 8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions but it looks like SVE2-AES will only become available in ARMv9.
That’s an astute analysis
To your question: I think 4 parallel AES should suffice on Graviton2 but I will double check with AWS internal folks and come back with recommendation to make it more generic for future cores
At this point, NEON implementation should suffice and no need for SVE/SVE2 versions as they won’t change the perf outcome
Sent from my iPhone
On May 31, 2021, at 10:23 AM, Volker Simonis @.***> wrote:
History
OpenJDK has intrinsic support on both, x86 and aarch64, for the basic AES block encryption/decryption operations. It's implemented in the corresponding generate_aescrypt_encryptBlock()/generate_aescrypt_decryptBlock() stubs which are used in LibraryCallKit::inline_aescrypt_Block() as substitures for implEncryptBlock()/implDecryptBlock() in the class com.sun.crypto.provider.AESCrypt.
However, x86 has some additional optimizations/intrinsics for the AES "Counter" mode (both "AES/CTR" and "AES/GCM", i.e. "Galois/Counter Mode") which are missing on aarch64 and led to the observed performance degradation on aarch64. They were introduced by the following changes:
8143925: Enhancing CounterMode.crypt() for AEShttps://bugs.openjdk.java.net/browse/JDK-8143925 Add intrinsic for CounterMode.crypt() to leverage the parallel nature of AES in Counter(CTR) Mode. http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/cb31a76eecd1 http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/72f54de44772 From the issue summary:
The request is to leverage the parallel nature of AES in Counter (CTR) Mode. In a single threaded implementation, this can be achieved by issuing independent x86 AES-NI instructions. Presently, there is an intrinsic for AESCrypt.implEncryptBlock(), which is called by CounterMode.crypt() method. However, the intrinsic works on one block at a time. The x86 AES-NI instructions have a latency of 6 or 7 clocks depending on the architecture. Since every AESENC instructions issued by this intrinsic is dependent on the earlier one, it does not take advantage of the CPU pipeline. We can optimize the performance of CounterMode.crypt() method by 4x-6x by issuing independent instructions from up to 6 blocks in parallel.
The change intrinsifies com.sun.crypto.provider.CounterMode::implCrypt() if and only if CounterMode's embedded cipher is of type com.sun.crypto.provider.AESCrypt. The stub for x86_64 is implemented in generate_counterMode_AESCrypt_Parallel().
8177784: Use CounterMode intrinsic for AES/GCMhttps://bugs.openjdk.java.net/browse/JDK-8177784 http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/0c8f43317c1f
This is a platform independent change to extend 8143925, which initially only applied to AES/CTR, to also work for AES/GCM. Mentioned here only for completeness.
Later 8143925 was further improved for AVX512 and the Vector AES instructions
8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructionshttps://bugs.openjdk.java.net/browse/JDK-8233741 https://hg.openjdk.java.net/jdk/jdk/rev/c6a789f495fe From the issue summary:
As per the Intel Architecture Instruction Set Referencehttps://software.intel.com/sites/default/files/managed/ad/01/253666-sdm-vol-2a.pdf, p.156-159 Vector AES (VAES) Operations will be supported in future Intel ISA. I would like to contribute an optimization for AES-CTR algorithm using AVX512+VAES instructions. This optimization is for x86_64 architecture that have AVX512-VAES enabled. I ran jtreg test suite with the algorithm on Intel SDEhttps://software.intel.com/en-us/articles/intel-software-development-emulator to confirm that encoding and semantics are correctly implemented.
The new, vectorized intrinsic for CounterMode::implCrypt() on x86_64 is implemented in generate_counterMode_VectorAESCrypt().
ToDo
As Intel mentioned in their change for 8143925, their AES instructions have a latency of 6/7 clock cycles, so they process up to 6 blocks in parallel in 8177784 to completely fill the pipeline. Depending on the latency of the AES instructions on Graviton 2, we should implement a similar intrinsic for aarch64 as well. I've created 8267993: [aarch64] Implement intrinsic for CounterMode::implCrypt()https://bugs.openjdk.java.net/browse/JDK-8267993 to track this in OpenJDK upstream.
Arm® NeoverseTM N1 Software Optimization Guidehttps://documentation-service.arm.com/static/5f05e93dcafe527e86f61acd?token=, p.58 mentions the following:
4.6 AES encryption/decryption Neoverse N1 can issue two AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined) with an execution latency of two cycles. This means encryption or decryption for at least four data chunks should be interleaved for maximum performance:
AESE data0, key0 AESMC data0, data0 AESE data1, key0 AESMC data1, data1 AESE data2, key0 AESMC data2, data2 AESE data3, key1 AESMC data3, data3 AESE data0, key0 AESMC data0, data0 ...
Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when they are adjacent in the program code and both instructions use the same destination register.
Does it make sense to process more than 4 blocks in parallel?
I also read about the ARM SVE2-AES extensionhttps://developer.arm.com/documentation/ddi0602/latest/SVE-Instructions/AESE--AES-single-round-encryption- which can probably used to implement something similar to 8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructionshttps://bugs.openjdk.java.net/browse/JDK-8233741 but it looks like SVE2-AES will only become available in ARMv9.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/aws/aws-graviton-getting-started/issues/110#issuecomment-851606569, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFTRWCLWNMLJCVZGWMM4JRTTQPAV7ANCNFSM45MJG3FA.
hi, @h3nd24 Are you using Corretto-11 or Corretto-8?
Are you using Corretto-11 or Corretto-8? @navyxliu This was with Corretto-11
Here is the PR of Interleave GCM. https://github.com/openjdk/jdk11u-dev/pull/410
please note it's off by default, user needs to explicitly enable it by -XX:+UseAESCTRIntrinsics
.
This patch has been backported to openjdk 11.0.14. We expect to see 8x speedup of GCM encrypt/decrypt on Gravition2. Next move, we will evaluate the performance gain on kafka.
Here is the result of corretto-11 nightly build on r6g instance.
java -jar ./target/benchmarks.jar -jvm ../../amazon-corretto-11.0.14.1.0-linux-aarch64/bin/java -jvmArgs '-XX:+UnlockDiagnosticVMOptions -XX:+UseAESCTRIntrinsics' org.openjdk.bench.javax.crypto.small.AESGCMBench.*
for dataSize/keyLength = (1024/128), we can see 3.5~5x more thoughput.
Benchmark (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
AESGCMBench.decrypt 1024 128 thrpt 40 727,936.577 ± 803.232 ops/s
AESGCMBench.decryptMultiPart 1024 128 thrpt 40 580,544.738 ± 2154.782 ops/s
AESGCMBench.encrypt 1024 128 thrpt 40 929,792.800 ± 2149.501 ops/s
AESGCMBench.encryptMultiPart 1024 128 thrpt 40 893,095.915 ± 1069.300 ops/s
Benchmark (dataSize) (keyLength) (provider) Mode Cnt Score Error Units
AESGCMBench.decrypt 1024 128 thrpt 40 173,051.939 ± 701.488 ops/s
AESGCMBench.decryptMultiPart 1024 128 thrpt 40 162,495.531 ± 1512.809 ops/s
AESGCMBench.encrypt 1024 128 thrpt 40 180,853.164 ± 1102.398 ops/s
AESGCMBench.encryptMultiPart 1024 128 thrpt 40 175,422.285 ± 2028.631 ops/s
Resolving as the JDK has the necessary backports, and commits to other projects to use the flag have been committed.
Hi, we were trying to do a fixed load performance test Kafka on Graviton (r6g.large) vs non-Graviton (r5.large) and it seems that Kafka on Graviton is doing way worse than it's Non-Graviton counterpart (only around half the throughput). The setup:
Here are the flame graph of the two runs, sampled at 1000hz for 1 minute during load in the zip file. flame-G is for Graviton node and flame-NG is for non-Graviton node. FlameGraphs.zip
Seems to me that in Graviton we spend much more time in encoding the encrypted message. Is there a known issue and workaround for this? For additional information, I did the same benchmarking setup except I turned off the client encryption. The result is that Graviton performs better than Non-Graviton counterpart. Find attached the benchmarking result BenchmarkingResult.tar.gz
Thanks for your help.