Closed Mr-Draco closed 3 years ago
Hi @MrDracoG,
I believe the base2_26 is due to a constant time algorithm, and using 5 each chunks 26-bit blocks = 130-bits. That's the ceiling of the 128-bit mac calculation.
Sorry I can't answer your other questions.
I believe the base2_26 is due to a constant time algorithm, and using 5 each chunks 26-bit blocks = 130-bits. That's the ceiling of the 128-bit mac calculation.
Quoting commentary:
On side note, Power ISA 2.07 enables vector base 2^26 implementation, ... Improvement is ~40% over -m64 result.
In other words, 2^26 radix/base is to be used in vector implementation, and the reference to the measured result means that the vector code path was in fact implemented. All code paths are constant-time, radix/base is not about it. Radix/base choice is governed by target ISA capabilities. For example on x86_64 you can spot 2^64, 2^26 and 2^44 radices...
This naturally answers the "Can Poly1305 algorithm be accelerated by using the vector facility?" question. It was accelerated.
Just in case. One naturally can't be more constant-time than the underlying platform, more specifically multiplication instructions. In other words, constant-time-ness is really the platform's problem, but it's not this implementation's.
How does your general purpose facility implementation of Poly1305 add the secret key to the accumulator as suggested by the pseudocode “a += s” in the IETF protocols [1]?
It's performed in poly1305_emit, search for "accumulate nonce" commentary line. Caller is expected to convert s
to 32-bit words.
Is the s1 register variable used to perform the modulo operation as suggested by the pseudocode “(r * a) % p” in the IETF protocols [1]?
I wouldn't describe it this way. I mean the "used to perform" wording suggests that it's the only thing it takes to perform the operation. s1 is just a pre-computed value, but it's not like it has to be pre-computed. In other words it's an optimization of a thing, not the thing itself...
What is the purpose of the poly1305_emit label?
Ultimately to convert the result from intermediate presentation to platform-neutral form. Yes, as already mentioned, it also accumulates nonce, but it's just a convenient spot to do it.
a) The name suggests that it is to produce the output of the poly1305 algorithm. However, it is unclear to me the purpose of the base2_26 notations in the poly1305_emit label.
The radix can change, and the emit procedure reconciles the differences.
How do the operations performed by the __poly1305_mul label that is used by the __poly1305_blocks_vsx label differ from the operations performed by the Loop label that is used by the Lpoly1305_blocks label? a) It seems to me that the operations are very similar and that there is little use of the vector facilities.
If you have just one (or few) 16-byte block(s) to process, it makes no sense to process it with vector instructions. Because it will be slower. In other words, the vectorized loop is used only when there is sufficient amount of data to process. So the correct question is not "what it does" per se, but "when it does." :-)
@noloader and @dot-asm, thank you for your responses.
@dot-asm, I understand that it wouldn't be beneficial to switch to a vectorized algorithm on a small message. However, from what I can tell in the 64 bit implementation, the poly1305_splat label [0] and the __poly1305_mul label [1] do not use vector registers or operations anyways, so, in the 64 bit implementation, what's the point of using the poly1305_blocks_vsx label [2] at all? What is being accelerated in the algorithm as a result of the __poly1305_blocks_vsx label in the 64 bit implementation?
0: https://github.com/openssl/openssl/blob/master/crypto/poly1305/asm/poly1305-ppc.pl#L874 1: https://github.com/openssl/openssl/blob/master/crypto/poly1305/asm/poly1305-ppc.pl#L836 2: https://github.com/openssl/openssl/blob/master/crypto/poly1305/asm/poly1305-ppc.pl#L910
I understand that it wouldn't be beneficial to switch to a vectorized algorithm on a small message.
How do you go from this to "what's the point"? Above is the point. Well, its logical negative, it's beneficial to use vectorized implementation with long messages... As already said...
However, from what I can tell in the 64 bit implementation, the poly1305_splat label [0] and the __poly1305_mul label [1] do not use vector registers or operations anyways, so, in the 64 bit implementation, what's the point of using the poly1305_blocks_vsx label [2] at all? What is being accelerated in the algorithm as a result of the __poly1305_blocks_vsx label in the 64 bit implementation?
Currently, I do not see how the 64 bit implementation of the poly1305_splat label and the __poly1305_mul label use any vectorization. Thus, if I think there is no vectorization happening in the poly1305_splat and the poly1305_mul labels, then why use the __poly1305_splat and the poly1305_mul labels.
The question isn't, "What's the point of vectorization?" nor is the question, "Why switch to using a vectorized implementation?" It is, "What's the point of using the poly1305_blocks_vsx label in the 64 bit implementation, if the two labels it calls, poly1305_splat and poly1305_mul, don't use vectorization?" If the __poly1305_splat and poly1305_mul do use vectorization, then the question is, "How are they vectorizing the algorithm?"
@MrDracoG,
What's the point of using the __poly1305_blocks_vsx label in the 64 bit implementation
VSX provides 64-bit loads and stores. VSX is an add-on to POWER7, and it is part of core POWER8. Prior to VSX, you needed to perform 32-bit loads or 16-byte aligned loads for Altivec. And I've never come across VSX+POWER7 in the field. Maybe Andy has come across it.
The odd thing (to me) is, VSX provides the 64-bit loads and stores, but there are no useful 64-bit vector operations like addm
and subm
. You need POWER8 for the useful 64-bit vector operations. In my mind's eye, there's little reason to partition around VSX since you need POWER8 for the good stuff anyways.
I believe what Andy does after the 128-bit load is, break the value into the 5 base-26 values - which fit into a 32-bit element - and operate on the 32-bit element. 32-bit Altivec units have been around since the early days.
The odd thing (to me) is, VSX provides the 64-bit loads and stores, but there are no useful 64-bit vector operations like
addm
andsubm
Well, VSX is a little bit ambiguous term, because it was extended but changes were not reflected in nomenclature. More specifically VSX itself was defined in 2.06 and extended in 2.07, among other things with 64-bit additions and subtractions, with the former being actually instrumental for this implementation. I mean it takes a 2.07-compliant processor such as POWER8 to execute it. Formally speaking one can argue that _vsx207
would be a more appropriate suffix...
"What's the point of using the poly1305_blocks_vsx label in the 64 bit implementation, if the two labels it calls, poly1305_splat and __poly1305_mul, don't use vectorization?"
If you see no point, then just don't use it:-) Nobody is twisting your arm... On more serious note, come on, for the 3rd time, when are they called? You also make it sound like calling these subroutines is all there is to it, which is as far from truth as it can get.
thank you @noloader
Hi Andy (@dot-asm),
I have been looking at your implementation of Poly1305 for PowerPC [0] and have a few questions.
References: 0: https://github.com/openssl/openssl/blob/master/crypto/poly1305/asm/poly1305-ppc.pl 1: https://datatracker.ietf.org/doc/html/rfc8439#section-2.5.1