AVX512: Code is slower than C

smuellerDD commented 7 months ago

The code for AVX512 support is slower on my system (11th Gen Intel(R) Core(TM) i7-1165G7) by some 10% compared to a pure C implementation. The culprit are the definitions of the constants for each round operation such as https://github.com/ascon/ascon-c/blob/f1601cb5ff52e65baa475fcc6959e7d6e0be8d77/crypto_aead/ascon128v12/avx512/round.h#L9. When moving the constant definitions out of the loop as performed in [1], the AVX512 code runs faster by some 10% compared to C.

PS: The AVX512 code also works nicely for hashes/XOFs - is there any reason why it is not added for those?

[1] https://github.com/smuellerDD/leancrypto/blob/master/hash/src/ascon_avx512.c#L29

mschlaeffer commented 7 months ago

Thanks for observing this! Does changing the constant definition to "const" also solve the issue?

Hash/XOF is something which just has not been done yet. But we'll add that as well.

smuellerDD commented 7 months ago

Am Dienstag, 9. April 2024, 13:19:51 CEST schrieb Martin Schläffer:

Hi Martin,

Thanks for observing this! Does changing the constant definition to "const" also solve the issue?

Hash/XOF is something which just has not been done yet. But we'll add that as well.

I will test it later, but I doubt it because I think that the code has to load the zmm register with the data every time the round is called.

If it would be possible to define it as static const (i.e. a fully compile- time constant), it should work. But the compiler (and perhaps it is not possible at all) cannot create the zmm register constant that can be used without another mov operation.

Ciao Stephan

mschlaeffer commented 7 months ago

I wonder why the compiler does not move the mov to zmm outside the loop, as it's done for normal C code? Maybe it's not able to look inside _mm512_set_epi64. Or the intrinsic results in a zmm mov which has to be implemented and is not allowed to be optimized.

smuellerDD commented 7 months ago

Am Dienstag, 9. April 2024, 13:41:23 CEST schrieb Martin Schläffer:

Hi Martin,

I wonder why the compiler does not move the mov to zmm outside the loop, as it's done for normal C code? Maybe it's not able to look inside _mm512_set_epi64. Or the intrinsic results in a zmm mov which has to be implemented and is not allowed to be optimized.

Let me check the changes in the disassembled compiled code when the definitions are inside / outside the inline functions.

Ciao Stephan

mschlaeffer commented 7 months ago

I just compiled using gcc 11.4.0 and disassembled (command attached). I only get the 10 expected avx512 instructions in each unrolled loop iteration. No mov in between. Note that a single additional instruction per round would already result in 10% overhead.

gcc -march=icelake-client -O2 -fomit-frame-pointer -Icrypto_aead/ascon128v12/avx512 crypto_aead/ascon128v12/avx512/aead.c -Itests -c && objdump -d aead.o

smuellerDD commented 7 months ago

Am Dienstag, 9. April 2024, 13:19:51 CEST schrieb Martin Schläffer:

Hi Martin,

Thanks for observing this! Does changing the constant definition to "const" also solve the issue?

I now came around to test it: when moving the definitions back inside the round inline function, but marked with const, the performance of AVX512 is now still higher than C.

Hash/XOF is something which just has not been done yet. But we'll add that as well.

Ciao Stephan

smuellerDD commented 7 months ago

Am Dienstag, 9. April 2024, 19:17:50 CEST schrieb Martin Schläffer:

Hi Martin,

I just compiled using gcc 11.4.0 and disassembled (command attached). I only get the 10 avx512 instructions in each unrolled loop iteration. No mov in between. Note that a single additional instruction per round would already result in 10% overhead.

gcc -march=icelake-client -O2 -fomit-frame-pointer -Icrypto_aead/ascon128v12/avx512 crypto_aead/ascon128v12/avx512/aead.c -Itests -c && objdump -d aead.o

After disassembling my I code, I can confirm that I see the several vmovdqa64 instructions to fill the different zmm registers, follwed by the 10 AVX512 instructions belonging to one round.

Ciao Stephan

mschlaeffer commented 7 months ago

In the disassemble, where exactly do you see the vmovdqa64 instructions? Before every 10 AVX512 belonging to one round? Before all rounds? Or even outside the while loop to process the data? I see them outside the while loop to process the data. So the compiler actually moves the vmovdqa64 to the right place. Which commands did you use to compile and what is the difference in the disassembly then?

smuellerDD commented 7 months ago

Am Mittwoch, 10. April 2024, 17:06:19 CEST schrieb Martin Schläffer:

Hi Martin,

In the disassemble, where exactly do you see the vmovdqa64 instructions? Before every 10 AVX512 belonging to one round? Before all rounds? Or even outside the while loop to process the data? I see them outside the while loop to process the data.

My entire code is structured a bit different, because I implemented also support for an init/update/final intermediate stepping as well as that I isolated the Ascon AEAD/hashing/XOF logic from the permutation logic to allow Keccak to be plugged into Ascon (paper on that is pending).

Anyhow, I only see the vmovdqa64 only once in my entire code base and after looking through the code, I see no jumps to the mov code. I.e. in my code, the vmovdqa64 operations are done only once for a given squeeze or permutation. When I translate this to your code, it implies that only one set of vmovdqa64 are done outside the while loop consuming the input data.

That only happens once I added the const to it.

So the compiler actually moves the vmovdqa64 to the right place. Which commands did you use to compile and what is the difference in the disassembly then?

Considering that I have a slightly different code consuming the permutation logic, my code seemingly requires the const to convince the compiler to move things to the right place.

Ciao Stephan

mschlaeffer commented 7 months ago

Thanks for providing the details! I'll probably also add the const. It might make the code more robust. Moving the vmovdqa64 is more tricky in our code base but also not needed. I'll close the issue then.

ascon / ascon-c

AVX512: Code is slower than C #16