dot-asm / cryptogams

CRYPTOGAMS distribution repository
Other
56 stars 16 forks source link

Questions about powerpc implementation of Poly1305 #11

Closed Mr-Draco closed 3 years ago

Mr-Draco commented 3 years ago

Hi Andy (@dot-asm),

I have been looking at your implementation of Poly1305 for PowerPC [0] and have a few questions.

  1. How does your general purpose facility implementation of Poly1305 add the secret key to the accumulator as suggested by the pseudocode “a += s” in the IETF protocols [1]?
  2. Is the s1 register variable used to perform the modulo operation as suggested by the pseudocode “(r * a) % p” in the IETF protocols [1]?
  3. What is the purpose of the poly1305_emit label? a) The name suggests that it is to produce the output of the poly1305 algorithm. However, it is unclear to me the purpose of the base2_26 notations in the poly1305_emit label.
  4. How do the operations performed by the __poly1305_mul label that is used by the __poly1305_blocks_vsx label differ from the operations performed by the Loop label that is used by the Lpoly1305_blocks label? a) It seems to me that the operations are very similar and that there is little use of the vector facilities.
  5. Can Poly1305 algorithm be accelerated by using the vector facility? a) The loop part of the algorithm is dependent on the accumulator value of the previous loop, so I am not sure if Poly1305 can be accelerated by using the vector facility. This would suggest why the operations for the Loop label and the __poly1305_mul label are similar.

References: 0: https://github.com/openssl/openssl/blob/master/crypto/poly1305/asm/poly1305-ppc.pl 1: https://datatracker.ietf.org/doc/html/rfc8439#section-2.5.1

noloader commented 3 years ago

Hi @MrDracoG,

I believe the base2_26 is due to a constant time algorithm, and using 5 each chunks 26-bit blocks = 130-bits. That's the ceiling of the 128-bit mac calculation.

Sorry I can't answer your other questions.

dot-asm commented 3 years ago

I believe the base2_26 is due to a constant time algorithm, and using 5 each chunks 26-bit blocks = 130-bits. That's the ceiling of the 128-bit mac calculation.

Quoting commentary:

On side note, Power ISA 2.07 enables vector base 2^26 implementation, ... Improvement is ~40% over -m64 result.

In other words, 2^26 radix/base is to be used in vector implementation, and the reference to the measured result means that the vector code path was in fact implemented. All code paths are constant-time, radix/base is not about it. Radix/base choice is governed by target ISA capabilities. For example on x86_64 you can spot 2^64, 2^26 and 2^44 radices...

This naturally answers the "Can Poly1305 algorithm be accelerated by using the vector facility?" question. It was accelerated.

Just in case. One naturally can't be more constant-time than the underlying platform, more specifically multiplication instructions. In other words, constant-time-ness is really the platform's problem, but it's not this implementation's.

dot-asm commented 3 years ago

How does your general purpose facility implementation of Poly1305 add the secret key to the accumulator as suggested by the pseudocode “a += s” in the IETF protocols [1]?

It's performed in poly1305_emit, search for "accumulate nonce" commentary line. Caller is expected to convert s to 32-bit words.

dot-asm commented 3 years ago

Is the s1 register variable used to perform the modulo operation as suggested by the pseudocode “(r * a) % p” in the IETF protocols [1]?

I wouldn't describe it this way. I mean the "used to perform" wording suggests that it's the only thing it takes to perform the operation. s1 is just a pre-computed value, but it's not like it has to be pre-computed. In other words it's an optimization of a thing, not the thing itself...

dot-asm commented 3 years ago

What is the purpose of the poly1305_emit label?

Ultimately to convert the result from intermediate presentation to platform-neutral form. Yes, as already mentioned, it also accumulates nonce, but it's just a convenient spot to do it.

a) The name suggests that it is to produce the output of the poly1305 algorithm. However, it is unclear to me the purpose of the base2_26 notations in the poly1305_emit label.

The radix can change, and the emit procedure reconciles the differences.

dot-asm commented 3 years ago

How do the operations performed by the __poly1305_mul label that is used by the __poly1305_blocks_vsx label differ from the operations performed by the Loop label that is used by the Lpoly1305_blocks label? a) It seems to me that the operations are very similar and that there is little use of the vector facilities.

If you have just one (or few) 16-byte block(s) to process, it makes no sense to process it with vector instructions. Because it will be slower. In other words, the vectorized loop is used only when there is sufficient amount of data to process. So the correct question is not "what it does" per se, but "when it does." :-)

Mr-Draco commented 3 years ago

@noloader and @dot-asm, thank you for your responses.

@dot-asm, I understand that it wouldn't be beneficial to switch to a vectorized algorithm on a small message. However, from what I can tell in the 64 bit implementation, the poly1305_splat label [0] and the __poly1305_mul label [1] do not use vector registers or operations anyways, so, in the 64 bit implementation, what's the point of using the poly1305_blocks_vsx label [2] at all? What is being accelerated in the algorithm as a result of the __poly1305_blocks_vsx label in the 64 bit implementation?

0: https://github.com/openssl/openssl/blob/master/crypto/poly1305/asm/poly1305-ppc.pl#L874 1: https://github.com/openssl/openssl/blob/master/crypto/poly1305/asm/poly1305-ppc.pl#L836 2: https://github.com/openssl/openssl/blob/master/crypto/poly1305/asm/poly1305-ppc.pl#L910

dot-asm commented 3 years ago

I understand that it wouldn't be beneficial to switch to a vectorized algorithm on a small message.

How do you go from this to "what's the point"? Above is the point. Well, its logical negative, it's beneficial to use vectorized implementation with long messages... As already said...

Mr-Draco commented 3 years ago

However, from what I can tell in the 64 bit implementation, the poly1305_splat label [0] and the __poly1305_mul label [1] do not use vector registers or operations anyways, so, in the 64 bit implementation, what's the point of using the poly1305_blocks_vsx label [2] at all? What is being accelerated in the algorithm as a result of the __poly1305_blocks_vsx label in the 64 bit implementation?

Currently, I do not see how the 64 bit implementation of the poly1305_splat label and the __poly1305_mul label use any vectorization. Thus, if I think there is no vectorization happening in the poly1305_splat and the poly1305_mul labels, then why use the __poly1305_splat and the poly1305_mul labels.

The question isn't, "What's the point of vectorization?" nor is the question, "Why switch to using a vectorized implementation?" It is, "What's the point of using the poly1305_blocks_vsx label in the 64 bit implementation, if the two labels it calls, poly1305_splat and poly1305_mul, don't use vectorization?" If the __poly1305_splat and poly1305_mul do use vectorization, then the question is, "How are they vectorizing the algorithm?"

noloader commented 3 years ago

@MrDracoG,

What's the point of using the __poly1305_blocks_vsx label in the 64 bit implementation

VSX provides 64-bit loads and stores. VSX is an add-on to POWER7, and it is part of core POWER8. Prior to VSX, you needed to perform 32-bit loads or 16-byte aligned loads for Altivec. And I've never come across VSX+POWER7 in the field. Maybe Andy has come across it.

The odd thing (to me) is, VSX provides the 64-bit loads and stores, but there are no useful 64-bit vector operations like addm and subm. You need POWER8 for the useful 64-bit vector operations. In my mind's eye, there's little reason to partition around VSX since you need POWER8 for the good stuff anyways.

I believe what Andy does after the 128-bit load is, break the value into the 5 base-26 values - which fit into a 32-bit element - and operate on the 32-bit element. 32-bit Altivec units have been around since the early days.

dot-asm commented 3 years ago

The odd thing (to me) is, VSX provides the 64-bit loads and stores, but there are no useful 64-bit vector operations like addm and subm

Well, VSX is a little bit ambiguous term, because it was extended but changes were not reflected in nomenclature. More specifically VSX itself was defined in 2.06 and extended in 2.07, among other things with 64-bit additions and subtractions, with the former being actually instrumental for this implementation. I mean it takes a 2.07-compliant processor such as POWER8 to execute it. Formally speaking one can argue that _vsx207 would be a more appropriate suffix...

dot-asm commented 3 years ago

"What's the point of using the poly1305_blocks_vsx label in the 64 bit implementation, if the two labels it calls, poly1305_splat and __poly1305_mul, don't use vectorization?"

If you see no point, then just don't use it:-) Nobody is twisting your arm... On more serious note, come on, for the 3rd time, when are they called? You also make it sound like calling these subroutines is all there is to it, which is as far from truth as it can get.

Mr-Draco commented 3 years ago

thank you @noloader