aws / aws-graviton-getting-started

Helping developers to use AWS Graviton2, Graviton3, and Graviton4 processors which power the 6th, 7th, and 8th generation of Amazon EC2 instances (C6g[d], M6g[d], R6g[d], T4g, X2gd, C6gn, I4g, Im4gn, Is4gen, G5g, C7g[d][n], M7g[d], R7g[d], R8g).
https://aws.amazon.com/ec2/graviton/
Other
878 stars 198 forks source link

Performance Benchmarking Result of Kafka with CLient Encryption in Graviton is Worse compared to Non-Graviton #110

Closed h3nd24 closed 2 years ago

h3nd24 commented 3 years ago

Hi, we were trying to do a fixed load performance test Kafka on Graviton (r6g.large) vs non-Graviton (r5.large) and it seems that Kafka on Graviton is doing way worse than it's Non-Graviton counterpart (only around half the throughput). The setup:

Here are the flame graph of the two runs, sampled at 1000hz for 1 minute during load in the zip file. flame-G is for Graviton node and flame-NG is for non-Graviton node. FlameGraphs.zip

Seems to me that in Graviton we spend much more time in encoding the encrypted message. Is there a known issue and workaround for this? For additional information, I did the same benchmarking setup except I turned off the client encryption. The result is that Graviton performs better than Non-Graviton counterpart. Find attached the benchmarking result BenchmarkingResult.tar.gz

Thanks for your help.

sebpop commented 3 years ago

Looking through the flamegraphs, the issue seems to be in bufferCrypt() https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/javax/crypto/CipherSpi.java#L749 The function takes 15.3% of all samples on non-Graviton and 28.59% of samples on Graviton. bufferCrypt() calls more native functions in non-Graviton: jbyte_disjoint_array, ghash_processBlocks, counterMode_AESCrypt than on Graviton: aescrypt_encrypt We need to dive in and see which of the native functions need to be ported to arm64.

h3nd24 commented 3 years ago

Stumbled across this article and decided to give it a go https://aws.amazon.com/blogs/opensource/introducing-amazon-corretto-crypto-provider-accp/ . The result looks much better than before, whereby there are improvements for both Graviton and Non-Graviton, but the improvement on Graviton is of much higher ratio that it's raw performance is better than Non-Graviton now. Find attached the result BenchmarkingResultWithACCP.tar.gz

In a sense it seems to me ACCP is kinda doing some native instructions already (via openSSL), and that definitely brings a lot of advantages.

simonis commented 3 years ago

History

OpenJDK has intrinsic support on both, x86 and aarch64, for the basic AES block encryption/decryption operations. It's implemented in the corresponding generate_aescrypt_encryptBlock()/generate_aescrypt_decryptBlock() stubs which are used in LibraryCallKit::inline_aescrypt_Block() as substitures for implEncryptBlock()/implDecryptBlock() in the class com.sun.crypto.provider.AESCrypt.

However, x86 has some additional optimizations/intrinsics for the AES "Counter" mode (both "AES/CTR" and "AES/GCM", i.e. "Galois/Counter Mode") which are missing on aarch64 and led to the observed performance degradation on aarch64. They were introduced by the following changes:

8143925: Enhancing CounterMode.crypt() for AES Add intrinsic for CounterMode.crypt() to leverage the parallel nature of AES in Counter(CTR) Mode. http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/cb31a76eecd1 http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/72f54de44772 From the issue summary:

The request is to leverage the parallel nature of AES in Counter (CTR) Mode. In a single threaded implementation, this can be achieved by issuing independent x86 AES-NI instructions. Presently, there is an intrinsic for AESCrypt.implEncryptBlock(), which is called by CounterMode.crypt() method. However, the intrinsic works on one block at a time. The x86 AES-NI instructions have a latency of 6 or 7 clocks depending on the architecture. Since every AESENC instructions issued by this intrinsic is dependent on the earlier one, it does not take advantage of the CPU pipeline. We can optimize the performance of CounterMode.crypt() method by 4x-6x by issuing independent instructions from up to 6 blocks in parallel.

The change intrinsifies com.sun.crypto.provider.CounterMode::implCrypt() if and only if CounterMode's embedded cipher is of type com.sun.crypto.provider.AESCrypt. The stub for x86_64 is implemented in generate_counterMode_AESCrypt_Parallel().

8177784: Use CounterMode intrinsic for AES/GCM http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/0c8f43317c1f

This is a platform independent change to extend 8143925, which initially only applied to AES/CTR, to also work for AES/GCM. Mentioned here only for completeness.

Later 8143925 was further improved for AVX512 and the Vector AES instructions

8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions https://hg.openjdk.java.net/jdk/jdk/rev/c6a789f495fe From the issue summary:

As per the Intel Architecture Instruction Set Reference, p.156-159 Vector AES (VAES) Operations will be supported in future Intel ISA. I would like to contribute an optimization for AES-CTR algorithm using AVX512+VAES instructions. This optimization is for x86_64 architecture that have AVX512-VAES enabled. I ran jtreg test suite with the algorithm on Intel SDE to confirm that encoding and semantics are correctly implemented.

The new, vectorized intrinsic for CounterMode::implCrypt() on x86_64 is implemented in generate_counterMode_VectorAESCrypt().

ToDo

As Intel mentioned in their change for 8143925, their AES instructions have a latency of 6/7 clock cycles, so they process up to 6 blocks in parallel in 8177784 to completely fill the pipeline. Depending on the latency of the AES instructions on Graviton 2, we should implement a similar intrinsic for aarch64 as well. I've created 8267993: [aarch64] Implement intrinsic for CounterMode::implCrypt() to track this in OpenJDK upstream.

Arm® NeoverseTM N1 Software Optimization Guide, p.58 mentions the following:

4.6 AES encryption/decryption Neoverse N1 can issue two AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined) with an execution latency of two cycles. This means encryption or decryption for at least four data chunks should be interleaved for maximum performance:

AESE  data0, key0
AESMC data0, data0
AESE  data1, key0
AESMC data1, data1
AESE  data2, key0
AESMC data2, data2
AESE  data3, key1
AESMC data3, data3
AESE  data0, key0
AESMC data0, data0
...

Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when they are adjacent in the program code and both instructions use the same destination register.

Does it make sense to process more than 4 blocks in parallel?

I also read about the ARM SVE2-AES extension which can probably used to implement something similar to 8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructions but it looks like SVE2-AES will only become available in ARMv9.

AWSNB commented 3 years ago

That’s an astute analysis

To your question: I think 4 parallel AES should suffice on Graviton2 but I will double check with AWS internal folks and come back with recommendation to make it more generic for future cores

At this point, NEON implementation should suffice and no need for SVE/SVE2 versions as they won’t change the perf outcome

Sent from my iPhone

On May 31, 2021, at 10:23 AM, Volker Simonis @.***> wrote:



History

OpenJDK has intrinsic support on both, x86 and aarch64, for the basic AES block encryption/decryption operations. It's implemented in the corresponding generate_aescrypt_encryptBlock()/generate_aescrypt_decryptBlock() stubs which are used in LibraryCallKit::inline_aescrypt_Block() as substitures for implEncryptBlock()/implDecryptBlock() in the class com.sun.crypto.provider.AESCrypt.

However, x86 has some additional optimizations/intrinsics for the AES "Counter" mode (both "AES/CTR" and "AES/GCM", i.e. "Galois/Counter Mode") which are missing on aarch64 and led to the observed performance degradation on aarch64. They were introduced by the following changes:

8143925: Enhancing CounterMode.crypt() for AEShttps://bugs.openjdk.java.net/browse/JDK-8143925 Add intrinsic for CounterMode.crypt() to leverage the parallel nature of AES in Counter(CTR) Mode. http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/cb31a76eecd1 http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/72f54de44772 From the issue summary:

The request is to leverage the parallel nature of AES in Counter (CTR) Mode. In a single threaded implementation, this can be achieved by issuing independent x86 AES-NI instructions. Presently, there is an intrinsic for AESCrypt.implEncryptBlock(), which is called by CounterMode.crypt() method. However, the intrinsic works on one block at a time. The x86 AES-NI instructions have a latency of 6 or 7 clocks depending on the architecture. Since every AESENC instructions issued by this intrinsic is dependent on the earlier one, it does not take advantage of the CPU pipeline. We can optimize the performance of CounterMode.crypt() method by 4x-6x by issuing independent instructions from up to 6 blocks in parallel.

The change intrinsifies com.sun.crypto.provider.CounterMode::implCrypt() if and only if CounterMode's embedded cipher is of type com.sun.crypto.provider.AESCrypt. The stub for x86_64 is implemented in generate_counterMode_AESCrypt_Parallel().

8177784: Use CounterMode intrinsic for AES/GCMhttps://bugs.openjdk.java.net/browse/JDK-8177784 http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/0c8f43317c1f

This is a platform independent change to extend 8143925, which initially only applied to AES/CTR, to also work for AES/GCM. Mentioned here only for completeness.

Later 8143925 was further improved for AVX512 and the Vector AES instructions

8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructionshttps://bugs.openjdk.java.net/browse/JDK-8233741 https://hg.openjdk.java.net/jdk/jdk/rev/c6a789f495fe From the issue summary:

As per the Intel Architecture Instruction Set Referencehttps://software.intel.com/sites/default/files/managed/ad/01/253666-sdm-vol-2a.pdf, p.156-159 Vector AES (VAES) Operations will be supported in future Intel ISA. I would like to contribute an optimization for AES-CTR algorithm using AVX512+VAES instructions. This optimization is for x86_64 architecture that have AVX512-VAES enabled. I ran jtreg test suite with the algorithm on Intel SDEhttps://software.intel.com/en-us/articles/intel-software-development-emulator to confirm that encoding and semantics are correctly implemented.

The new, vectorized intrinsic for CounterMode::implCrypt() on x86_64 is implemented in generate_counterMode_VectorAESCrypt().

ToDo

As Intel mentioned in their change for 8143925, their AES instructions have a latency of 6/7 clock cycles, so they process up to 6 blocks in parallel in 8177784 to completely fill the pipeline. Depending on the latency of the AES instructions on Graviton 2, we should implement a similar intrinsic for aarch64 as well. I've created 8267993: [aarch64] Implement intrinsic for CounterMode::implCrypt()https://bugs.openjdk.java.net/browse/JDK-8267993 to track this in OpenJDK upstream.

Arm® NeoverseTM N1 Software Optimization Guidehttps://documentation-service.arm.com/static/5f05e93dcafe527e86f61acd?token=, p.58 mentions the following:

4.6 AES encryption/decryption Neoverse N1 can issue two AESE/AESMC/AESD/AESIMC instruction every cycle (fully pipelined) with an execution latency of two cycles. This means encryption or decryption for at least four data chunks should be interleaved for maximum performance:

AESE data0, key0 AESMC data0, data0 AESE data1, key0 AESMC data1, data1 AESE data2, key0 AESMC data2, data2 AESE data3, key1 AESMC data3, data3 AESE data0, key0 AESMC data0, data0 ...

Pairs of dependent AESE/AESMC and AESD/AESIMC instructions are higher performance when they are adjacent in the program code and both instructions use the same destination register.

Does it make sense to process more than 4 blocks in parallel?

I also read about the ARM SVE2-AES extensionhttps://developer.arm.com/documentation/ddi0602/latest/SVE-Instructions/AESE--AES-single-round-encryption- which can probably used to implement something similar to 8233741: AES Countermode (AES-CTR) optimization using AVX512 + VAES instructionshttps://bugs.openjdk.java.net/browse/JDK-8233741 but it looks like SVE2-AES will only become available in ARMv9.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/aws/aws-graviton-getting-started/issues/110#issuecomment-851606569, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFTRWCLWNMLJCVZGWMM4JRTTQPAV7ANCNFSM45MJG3FA.

navyxliu commented 3 years ago

hi, @h3nd24 Are you using Corretto-11 or Corretto-8?

charlese-instaclustr commented 3 years ago

Are you using Corretto-11 or Corretto-8? @navyxliu This was with Corretto-11

navyxliu commented 2 years ago

Here is the PR of Interleave GCM. https://github.com/openjdk/jdk11u-dev/pull/410

please note it's off by default, user needs to explicitly enable it by -XX:+UseAESCTRIntrinsics.

navyxliu commented 2 years ago

This patch has been backported to openjdk 11.0.14. We expect to see 8x speedup of GCM encrypt/decrypt on Gravition2. Next move, we will evaluate the performance gain on kafka.

navyxliu commented 2 years ago

Here is the result of corretto-11 nightly build on r6g instance.

java -jar ./target/benchmarks.jar -jvm ../../amazon-corretto-11.0.14.1.0-linux-aarch64/bin/java -jvmArgs '-XX:+UnlockDiagnosticVMOptions -XX:+UseAESCTRIntrinsics' org.openjdk.bench.javax.crypto.small.AESGCMBench.*

for dataSize/keyLength = (1024/128), we can see 3.5~5x more thoughput.

Benchmark                     (dataSize)  (keyLength)  (provider)   Mode  Cnt       Score      Error  Units
AESGCMBench.decrypt                 1024          128              thrpt   40  727,936.577 ±  803.232  ops/s
AESGCMBench.decryptMultiPart        1024          128              thrpt   40  580,544.738 ± 2154.782  ops/s
AESGCMBench.encrypt                 1024          128              thrpt   40  929,792.800 ± 2149.501  ops/s
AESGCMBench.encryptMultiPart        1024          128              thrpt   40  893,095.915 ± 1069.300  ops/s

Benchmark                     (dataSize)  (keyLength)  (provider)   Mode  Cnt       Score      Error  Units
AESGCMBench.decrypt                 1024          128              thrpt   40  173,051.939 ±  701.488  ops/s
AESGCMBench.decryptMultiPart        1024          128              thrpt   40  162,495.531 ± 1512.809  ops/s
AESGCMBench.encrypt                 1024          128              thrpt   40  180,853.164 ± 1102.398  ops/s
AESGCMBench.encryptMultiPart        1024          128              thrpt   40  175,422.285 ± 2028.631  ops/s
navyxliu commented 2 years ago

the preview link, expire in 7 days

geoffreyblake commented 2 years ago

Resolving as the JDK has the necessary backports, and commits to other projects to use the flag have been committed.