furkanturan / NaCl-Hardware

NaCl Hardware Implementation as a cryptographic coprocessor with example use on a Linux user space application
4 stars 2 forks source link

benchmarks? #1

Open cjdelisle opened 7 years ago

cjdelisle commented 7 years ago

This looks like a very interesting project (I develop a VPNish routing software called cjdns which uses salsa20/poly1305) but I was wondering if you had carried out any benchmarks and whether you have a theory about how the principle should scale with different size FPGAs. Thanks.

furkanturan commented 7 years ago

It was my master's thesis project in KU Leuven, and the benchmarks were done by comparing the performance of HW accelerated operation, to the original software implementation on Zynq's ARM device. Two measurements were done; first the effects to execution time for encrypting/decrypting arbitrary messages, and the effects to bandwidth increase when the coprocessors is integrated to SigmsVPN. With the below links, you can reach the paper, where you can find the comparison tables. (Maybe, I just put the links to readme file as well.) Just copying the conclusion here:

In this paper we designed a hardware-based coprocessor to accelerate the cryptographic operations of SigmaVPN, an open-source software VPN solution. The design was implemented on a Xilinx Zynq-7010 SoC, with SigmaVPN running on top of Linux on the ARM processing system, and the coprocessor programmed into the programmable logic. Our evaluation shows that the coprocessor improves the performance of cryptographic operations with increasing gains for larger Ethernet frames. Our coprocessor encrypts a 1024-byte frame in 93% less time when compared to the software-only solution, even though it suffers from overheads in hardware-software communication. Its integration with SigmaVPN offers the TCP and UDP bandwidth increase by a factor of 4.36 and 5.36 respectively for 1024-byte Ethernet frames.

https://www.esat.kuleuven.be/cosic/publications/article-2693.pdf https://www.esat.kuleuven.be/cosic/publications/thesis-271.pdf

For scaling with different size benchmarks: I did not run it in a different size FPGA, but used the smallest Zynq device available. First, I was running a single unit of Salsa20, and Poly1305 processing units, but parallelization was possible. Then I doubled them; however, I couldn't get double the performance, since my FPGA was small, I hit an excessive resource utilization problem, and get a longer critical path. Eventually, I reduced the clock frequency, and could only get something like 20% improvement for 1024 byte Ethernet frames. If a bigger FPGA is available, two, three, or more processing units can be instantiated to improve the execution time. However, it will not scale the execution time improvement directly with the same number, as there are overheads before processing the frames; which are transferring it, calculating the message specific encryption key.