betrusted-io / gateware

IP submodules, formatted for easier CI integration
Other
28 stars 5 forks source link

Interest for other crypto algs in the EC25519 Engine ? #6

Open rdolbeau opened 3 years ago

rdolbeau commented 3 years ago

Hello,

For my use case of the Engine (FPGA-based crypto accelerator device for an existing vintage host), EC25519 is nice but I figured I could also have the other useful algos for SSH/SCP in the same accelerator (rather than needing Yet Another Device). So I did a draft implementation of some instructions to support AES/GCM. It includes some pclmulqdq-like instructions (multi-cycle eng_clk unit), the required shifts and permutations to support reduction and data ordering (single-cycle eng_clk unit), AES round instruction (multi-cycle mul_clk unit) and a load-store unit for faster access to the streamed data (variable multi-cycle mul_clk unit). It's all very experimental and not very clean. Also it's not optimized for area (or speed), it's just a functional implementation, with room for improvements in every aspect, presumably. Endianness handling is a bit of a mess (host is BE, but the DMA engine through which the L/S goes is byte-reverting the 32-bits words to support the OHCI USB controller with an unpatched NetBSD driver...), but the current draft code currently works to implement a stand-alone version of Supercop's aes256gcmv1 test bed.

Not sure if this will be interesting to anyone but me, but I thought I'd mention it just in case. FYI, the current engine code is here, while the current programs for the Engine are here.

Cordially,

bunnie commented 3 years ago

Wow! these are really cool. Do you have any rough guesses on how many extra LUTs each of the different execution engines introduces to the FPGA?

Also, I don't know if you saw this, but a bug was found in the curve25519 engine. You'll want to apply this patch:

@@ -1413,7 +1413,7 @@ carries that have already been propagated. If we fail to do this, then we re-pro
                 self.dsp_match3 &
                 self.dsp_match2 &
                 self.dsp_match1 &
-                (self.dsp_p0 >= 0x1_ffed)
+                (self.dsp_p0[:17] >= 0x1_ffed)
             )
         ]

check out https://github.com/betrusted-io/xous-core/issues/76 for details and tracking

rdolbeau commented 3 years ago

@bunnie They are huge, unfortunately. From my notes, I have a synthesis at 13263/9099 Slice LUTs/Slice Registers before, latest is 18277/11003 (and the rest of the design should be almost identical). The code is very naive and does not try to optimize area except for implementing things for 128 bits and doing two passes (currently the code doesn't use the upper 128-bits lane in any way).

I originally planned to use a single-cycle GHASH add-mul 128-bits operator, but didn't even try when I remembered in my old VHDL-based design it was taking like 40-45% of my A7-35T all on its own (and was only tested at 25 MHz).

The current implementation for GHASH still has two full 64x64 polynomial multipliers (low and high) @ 50 MHz, which are probably huge - I didn't even try to put them in the mul_clk (I probably should try rather than rely on my very, very limited HW design experience... of that one project only). Them and the other instructions are basically designed to run the algorithms from Intel's book on the subject (the clmul instruction is pretty much pclmulqdq...), so there's also some pretty big shifters (and introducing bits from the second operand as it saves a lot of hassle). I have no idea how to reduce those (well, shifts could be done bit-by-bit).

AES is now basically a 'do one round' instruction. Easier for coding, and for 'speed' it has 4 full look-up tables (could be done with 1/2 or 1/4 by using 2x or 4x the cycles reasonably easily). So far key generation is not offloaded, it's done on the host and pushed into 15 registers (which is probably not very fast, key generation should probably be done in the Engine as well).

Neither clmul nor AES avoids computing on the upper halves when not needed, they just throw away the result, which power-wise is probably not good. That feature is mostly needed for L/S where the upper half is bypassed.

Thanks for the notice on the patch. Still have to do the integration in OpenSSL, which is going to be the annoying part...