Sadoon-AlBader commented 5 years ago

Very simple change in the way the poly1305 functions handle 4-bytes to 32-bit integer conversions. We de-reference the byte while casted as an unsigned integer and that automatically stores or loads the data as it should be. No need for any operations to be done on the data (shifting, ANDing, ORing, etc..) The reason why I did not optimize the other load and store functions instead of just optimizing Poly1305 was because I used "make speed" and the only code that gets speed benefits is Poly1305, everything else is either the same or regresses.

Results using "make speed" 1490 MB/s --> 1632 MB/s (AMD Ryzen Threadripper 1920X 2x16GB ECC RAM, CPU 4.1GHz, MEM 3GHz) 1334 MB/s --> 1470 MB/s (AMD Ryzen 5 2500U 2x8GB RAM, CPU 2.2GHz Turbo 3.6GHz, MEM 2.4GHz) 1281 MB/s --> 1244 MB/s (Intel i7 4500U 2x4GB RAM, CPU 1.7GHz Turbo 3.0 GHz, MEM 2.133GHz) 1461 MB/s --> 1461 MB/s (iMac 5K Late 2015, i5 6600 3.3GHz Turbo 3.9GHz, 4x16GB RAM 1.866GHz)

It seems the change favours AMD's Zen architecture more than Intel's, I will investigate further once I have more hardware on hand, but you can test this yourself by cloning and using "make speed"

Here is code that compares the two ways of converting, and you can clearly see that my version only needs 3 "mov" operations while the older code needs about 30 when you use https://godbolt.org

examplecode.c.zip

mikejsavage commented 5 years ago

clang -Wcast-align is not happy:

examplecode.c:23:13: warning: cast from 'ubyte *' (aka 'unsigned char *') to 'uint *' (aka 'unsigned int *') increases required
      alignment from 1 to 4 [-Wcast-align]
  uint b = *(uint*)a;
            ^~~~~~~~

but you can do like

static u32 load_u32( const u8 * buf ) {
    u32 res;
    memcpy( &res, buf, sizeof( res ) );
    return res;
}

and compilers are smart enough to optimise that.

I believe this also needs some extra code to handle big endian platforms

Also if that second block is a 16 byte load, memcpy( 16 ) should optimise down to movdqu

Sadoon-AlBader commented 5 years ago

but you can do like
static u32 load_u32( const u8 * buf ) {
    u32 res;
    memcpy( &res, buf, sizeof( res ) );
    return res;
}
and compilers are smart enough to optimise that.

Sorry, close was by mistake. I will try your function and see if it gives similar results. Thank you. EDIT: It is just as fast as my solution on my Threadripper machine, thank you! We do have to #include to get memcpy though, and it is 4 lines of assembly calling a function with 9 lines.

I believe this also needs some extra code to handle big endian platforms

I think a static if during compile time would be sufficient, if we cannot find an easy way to optimize on big endian machines, it would point back to the original function using #define

tankf33der commented 5 years ago

We all see you know tricks and care about CPU ticks, this is good, but you broke HPUX (PA-RISC) platform. Expected, right?
Create special form for poly and ignore chacha and EC where load32_le() also used is strange too.
Since I see D programming language in your profile much better create and maintain DUB bindings for fun and profit.

👎

Sadoon-AlBader commented 5 years ago

1. We all see you know tricks and care about CPU ticks, this is good, but you **broke** HPUX (PA-RISC) platform. Expected, right?

Like I said in the previous comment, I will use a static if to select based on endianness during compile-time.

2. Create special form for poly and ignore chacha and EC where load32_le() also used is strange too.

Again, like I said in my pull request comment, in my testing I did not find any speedups in other code.

3. Since I see D programming language in your profile much better create and **maintain** DUB bindings for fun and profit.

I assure you I will do that (nor for profit though) soon for the Poly1305 component and after that the full MonoCypher library :) It's part of my project for my independent study course. Thank you for your concerns regarding endianness and other code but please read the previous comments before repeating what @mikejsavage already excellently pointed out :)

mliberty1 commented 5 years ago

Thank you to Sadoon-AlBader for this work. Unfortunately, I think that your work on optimizing this library misunderstands the goal of Monocypher. The advantage of Monocypher is that is safe, readable, and completely portable by default. If you really want speed, use libsodium or any number of other optimized crypto libraries.

The ARM Cortex-M architectures will bus fault if you attempt an unaligned (u32) access. Now, in this case, the key should be 32-bit aligned, and all accesses appear to be at 4-byte offsets, so remain aligned. However, this change is potentially dangerous without thorough testing on all target platforms, which is untestable since you don't know! By keeping the code within the realm of defined C behavior, these operations remain safe.

Personally, I am not interested in Monocypher speed improvements that sacrifice portability or safety.

Sadoon-AlBader commented 5 years ago

Thank you @mliberty1 for your explanation, I now understand better why this optimization is not suitable for Monocypher. Have a nice weekend :)

LoupVaillant commented 5 years ago

Hi, @Sadoon-AlBader,

Don't feel too bad about this PR, I was delighted to even see such an attempt. Yes, this PR is not portable (fails on big endian architectures), and unsafe (your load is actually undefined behaviour). Still, it would seem my version does not generate optimal code, which is a bit of a problem. I'll try and see if there's a way to make it better…

What do you know, I managed to get a 5% increase just by rolling the loading loop:

diff --git a/src/monocypher.c b/src/monocypher.c
index 55a0d4c..efa04f3 100644
--- a/src/monocypher.c
+++ b/src/monocypher.c
@@ -388,10 +388,9 @@ void crypto_poly1305_update(crypto_poly1305_ctx *ctx,
     // Process the message block by block
     size_t nb_blocks = message_size >> 4;
     FOR (i, 0, nb_blocks) {
-        ctx->c[0] = load32_le(message +  0);
-        ctx->c[1] = load32_le(message +  4);
-        ctx->c[2] = load32_le(message +  8);
-        ctx->c[3] = load32_le(message + 12);
+        FOR (i, 0, 4) {
+            ctx->c[i] = load32_le(message +  i*4);
+        }
         poly_block(ctx);
         message += 16;
     }

There's a chance the loop helps with vectorisation. Anyway, thank you for pointing out this low hanging fruit. I wouldn't have seen it without your attempt. I'll push the patch right away.

mliberty1 commented 5 years ago

@Sadoon-AlBader, Optimizing a library like this for a specific platform is a great learning experience! You may have already learned something about the nasty underbelly of C - undefined behavior. If you continue to investigate, you will likely find a lot of platform-specific optimizations that cannot be upstreamed. However, you will also likely find some safe & portable optimizations that can be upstreamed. I am sure we would all welcome these optimizations!

Sadoon-AlBader commented 5 years ago

There's a chance the loop helps with vectorisation. Anyway, thank you for pointing out this low hanging fruit. I wouldn't have seen it without your attempt. I'll push the patch right away.

I'm glad I helped in some way :) Given that this is my first PR I think it could have gone much worse xD I am actually implementing Poly1305 on hardware (using VHDL so far) and your library helped me a lot in understanding the algorithm and how to optimize it (especially the remainder part), thank you!

@Sadoon-AlBader, Optimizing a library like this for a specific platform is a great learning experience! You may have already learned something about the nasty underbelly of C - undefined behavior. If you continue to investigate, you will likely find a lot of platform-specific optimizations that cannot be upstreamed. However, you will also likely find some safe & portable optimizations that can be upstreamed. I am sure we would all welcome these optimizations!

Thank you, indeed I am learning a lot, and thank you all for the welcoming comments, much appreciated :)

mliberty1 commented 5 years ago

@Sadoon-AlBader I am very interested in your VHDL implementation. I thought about creating an open source Poly1305 and Blake2b FPGA implementation for my Joulescope project, but decided to leave the authentication/decryption to the microcontroller due to time. If your work is open source, please let me know where I can follow along.

Sadoon-AlBader commented 5 years ago

@mliberty1 I will check out your project once I'm free, and will definitely let you know of anything to do with my project. I'm planning to publish it on GitLab with a mirror on GitHub, but it is in very early stages right now :)

LoupVaillant commented 5 years ago

If anyone is interested, @secworks has a Verilog implementation of Poly1305. Not VHDL, but perhaps you could find some inspiration there?

Sadoon-AlBader commented 5 years ago

If anyone is interested, @secworks has a Verilog implementation of Poly1305. Not VHDL, but perhaps you could find some inspiration there?

Thank you, will definitely check it out!

secworks commented 5 years ago

Aloha!

Note the status info for the Poly1305 work. It is not complete and does NOT work. I hope to be able to spend time later this year to complete the core.

Regards, JoachimS

On 2019-03-23 12:20, Loup Vaillant wrote:

If anyone is interested, @secworks https://github.com/secworks has a Verilog implementation of Poly1305 https://github.com/secworks/poly1305. Not VHDL, but perhaps you could find some inspiration there?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LoupVaillant/Monocypher/pull/118#issuecomment-475861442, or mute the thread https://github.com/notifications/unsubscribe-auth/ABv4ZZ3i-orT_h8L8ijCXZQYL0AOEs72ks5vZg3ugaJpZM4cCoze.

-- Med vänlig hälsning, Yours

Joachim Strömbergson

                           Assured AB

========================================================================

Sadoon-AlBader commented 5 years ago

Aloha! Note the status info for the Poly1305 work. It is not complete and does NOT work. I hope to be able to spend time later this year to complete the core. Regards, JoachimS … On 2019-03-23 12:20, Loup Vaillant wrote: If anyone is interested, @secworks https://github.com/secworks has a Verilog implementation of Poly1305 https://github.com/secworks/poly1305. Not VHDL, but perhaps you could find some inspiration there? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#118 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABv4ZZ3i-orT_h8L8ijCXZQYL0AOEs72ks5vZg3ugaJpZM4cCoze. -- Med vänlig hälsning, Yours Joachim Strömbergson ======================================================================== Assured AB ========================================================================

Thanks for letting us know, I did fork this project https://github.com/Sadoon-AlBader/NaCl-Hardware and am planning to migrate it from Xilinx to Altera. @mliberty1 you said you would be interested in something like this :)

LoupVaillant / Monocypher

Made Poly1305 10% faster #118

Joachim Strömbergson