Mbed-TLS / mbedtls

An open source, portable, easy to use, readable and flexible TLS library, and reference implementation of the PSA Cryptography API. Releases are on a varying cadence, typically around 3 - 6 months between releases.
https://www.trustedfirmware.org/projects/mbed-tls/
Other
5.49k stars 2.59k forks source link

Support for asynchronous bulk cryptography acceleration #6669

Open DemiMarie opened 1 year ago

DemiMarie commented 1 year ago

Suggested enhancement

Support asynchronous implementations for bulk encryption in TLS.

Justification

Mbed TLS needs this because any embedded systems have hardware accelerators for bulk encryption and decryption. These always have asynchronous interfaces as they run in parallel with the CPU. Busy-waiting for results will ruin performance.

gilles-peskine-arm commented 1 year ago

Asynchronous behavior is tricky when you aren't in control of the whole system. Mbed TLS is just a library, which needs to work on all reasonable operating systems (or lack thereof).

The most natural way to handle parallel accelerators is to have the thread calling the accelerator blocked waiting for the result. While that thread is block, the operating system schedules other threads. When the accelerator finishes, it triggers an interrupt which unblocks the calling thread.

Mbed TLS does have features for highly constrained systems that don't have a thread scheduler, or high-performance servers that have an inefficient thread scheduler. The TLS stack supports doing asymmetric cryptography asynchronously. High-performance servers don't have asynchronous accelerators for symmetric cryptography nowadays, however. I could see the interest in having support for asynchronous symmetric cryptography if the symmetric cryptography happens in an HSM — not really useful for session keys, but useful for the PSK-to-MS derivation. (However, my recommendation in that case remains to use a language with better support for multithreading than Java.)

So what's the use case here? A system that has too little RAM or too little code size for a thread scheduler, but has a cipher accelerator, and needs to handle interrupts while the accelerator is running?

DemiMarie commented 1 year ago

So what's the use case here? A system that has too little RAM or too little code size for a thread scheduler, but has a cipher accelerator, and needs to handle interrupts while the accelerator is running?

The use case I am thinking of right now is an embedded system with a cipher accelerator that needs to transfer large amounts of data along a TLS-encrypted connection. The accelerator has very high throughput (much higher than the entire CPU cluster) but has a deep pipeline with substantial latency. To keep the accelerator’s pipeline full, it is necessary to submit many requests without waiting for results to be available. Achieving this with a blocking model would require using a large number of threads for a single connection, which would be silly.

gilles-peskine-arm commented 1 year ago

I still don't understand how this works: independent encryption or decryption of successive TLS records?

The support we have for asynchronicity so far is having functions return after submitting a request, and resuming work when the response is available. Even those were medium-sized projects. If I understand correctly, you're asking for some reordering, where the TLS stack would submit a request, then continue some processing. This sounds considerably harder.

Are you aware of TLS stacks with this capability? Were they designed with it from scratch or was it added later? Are they in C or in a high-level language with better concurrency support such as Rust or Erlang?

DemiMarie commented 1 year ago

I still don't understand how this works: independent encryption or decryption of successive TLS records?

Pretty much. That would mean that errors could be discovered while there are still other requests in flight.

The support we have for asynchronicity so far is having functions return after submitting a request, and resuming work when the response is available. Even those were medium-sized projects. If I understand correctly, you're asking for some reordering, where the TLS stack would submit a request, then continue some processing. This sounds considerably harder.

Yeah, it does.

Are you aware of TLS stacks with this capability? Were they designed with it from scratch or was it added later? Are they in C or in a high-level language with better concurrency support such as Rust or Erlang?

OpenSSL can use asynchronous accelerators such as Intel QAT, but I am not sure if can achieve any concurrency within a connection, or if it relies on a large number of concurrent connections to keep the pipelines full. I am also not sure about Linux kTLS.

That said, trying to add such a complex (and likely bug-prone) state machine to mbedTLS might be a bad idea, especially in the absence of a concrete need (which I do not have).