Mbed-TLS / mbedtls

An open source, portable, easy to use, readable and flexible TLS library, and reference implementation of the PSA Cryptography API. Releases are on a varying cadence, typically around 3 - 6 months between releases.
https://www.trustedfirmware.org/projects/mbed-tls/
Other
5.27k stars 2.57k forks source link

Degraded Performance if client and server are on the same machine. #7471

Open ishkhan42 opened 1 year ago

ishkhan42 commented 1 year ago

Summary

Low handshake performance when the client and server are run on the same machine compared to when they are run on different machines. The expected behavior is much higher performance when both are on the same machine. The issue can be reproduced by following the steps provided, which include building and running the server and client. Comparing the performance of server.cpp (mbedtls version) to server.py (OpenSSL version) on the same machine highlights the performance difference.

System information

Mbed TLS version (number or commit id): 3.4.0 Operating system and version: Ubuntu 20.04 x86_64 Configuration (if not default, please attach mbedtls_config.h): Default Compiler and options (if you used a pre-built binary, please indicate how you obtained it): -pedantic -fPIC -fno-exceptions -O3 Additional environment information:

Expected behavior

The performance of handshakes between the client and server when run on the same machine should be significantly higher compared to when they are on different machines.

Actual behavior

When the client and server are run on the same machine, the handshake performance is extremely low, with only 20 handshakes per second, compared to at least 10 times higher performance when run on different machines.

Steps to reproduce

  1. Download the code from this gist.
  2. Build the server with cmake . && make server.
  3. Run the server with ./server.
  4. Run the client with python3 client.py. The output should show the number of requests per second, which is roughly equivalent to the number of handshakes per second.

If you have a second machine on a network, try running the server on one machine and the client on the other. This should show a significant performance difference, provided that your machine is not the bottleneck.

In contrast, running server.py, an SSL server in Python that uses OpenSSL, on the same machine as client.py will show how much faster it is compared to server.cpp, which uses mbedtls.

Additional information

The main machine on which I tested this is AMD Ryzen Threadripper with 64-Cores.

gilles-peskine-arm commented 1 year ago

I suspect that the difference is not same machine vs different machine, but the selection of cipher suites or other TLS parameters, due to having different libraries or configuration on the two machines. Please use a tool such as Wireshark or some client logs to check what parameters are negotiated.

I tested locally on Ubuntu 22.04 and I can reproduce a ~x10 performance difference. I notice that openssl-to-openssl is using TLS 1.3 with TLS_AES_256_GCM_SHA384 and ECDH on x25519, whereas openssl-to-mbedtls is using TLS 1.2 with TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 and x25519. This machine has AESNI so AES is likely faster than Chacha-Poly, though given the small data size I'd expect the handshake to be the dominating factor, so the difference may just be that our default x25519 implementation is a lot slower than OpenSSL's — I haven't dug further. If that turns out to be the dominating factor, please try enabling MBEDTLS_ECDH_VARIANT_EVEREST_ENABLED in the mbedtls configuration: this is a faster implementation of x25519 (at the expense of code size, and it doesn't compile with some “exotic” compilers).

I tried with a remote Ubuntu 22.04 machine and saw the same performance for mbedtls, but a bit slower for the remote connection for openssl (which makes sense because the second machine I used has a considerably slower CPU).

ishkhan42 commented 1 year ago

After using a profiler on the server, it's clear that the bottleneck occurs during the handshake, specifically in mbedtls_mpi_exp_mod. Despite enabling MBEDTLS_ECDH_VARIANT_EVEREST_ENABLED based on your suggestion, I didn't see any improvements. To further investigate the issue, I ran a client based on mbedtls/programs/ssl/ssl_client1.c against the mbedtls server, but unfortunately, the performance remained degraded, in fact slightly lower.

Here is the generated flamegraph. image_2023-04-25_11-13-16

gilles-peskine-arm commented 1 year ago

Ok, thanks for digging further. So the bottleneck is the RSA signature (which, in terms of CPU consumption, is basically mpi_exp_mod). I don't see how running the client on the same machine would affect anything. RSA signature is very much CPU-bound, it would be affected by other processes clobbering the cache but the client is idle while it happens.

It's still unclear to me whether this is a generic complaint that our RSA private-key operation is too slow, that it's too slow when a TLS client is running on the same machine (which seems really weird), or some other difference. From the time profile, it really looks like the RSA private-key operation overwhelmingly dominates performance, and I don't see how the client's behavior would affect that.

ishkhan42 commented 1 year ago

Is it unreasonable to expect mbedtls to be faster than openssl?

To help identify the issue, I have run several additional tests on two machines: It appears that there is an issue when running the client on machine B with the MbedTLS server. However, when the server is switched to OpenSSL, the client on machine B produces a much higher request rate.

Machine Processor OS Kernel
A AMD EPYC 7302 16-Core Ubuntu 21.04 64bit Linux 5.11.0-16-generic
B AMD Ryzen Threadripper PRO 3995WX 64-Cores Ubuntu 22.04 64bit Linux 5.15.0-70-generic

Test Server Configuration Client Configuration Requests per second
1 MbedTLS on A OpenSSL on A 115
2 MbedTLS on B OpenSSL on B 20
3 MbedTLS on A OpenSSL on B 19
4 MbedTLS on B OpenSSL on A 215
5 OpenSSL on B OpenSSL on B 406
6 OpenSSL on A OpenSSL on A 314
7 OpenSSL on B OpenSSL on A 415
8 OpenSSL on A OpenSSL on B 288
gilles-peskine-arm commented 1 year ago

Is it unreasonable to expect mbedtls to be faster than openssl?

In general, I'd say yes, actually, because the two projects have different emphases. Mbed TLS has a more stringent threat model: we aim to protect against local timing attacks whereas OpenSSL doesn't. Mbed TLS is more targeted at resource-constrained devices and so we spend more effort on code size and portability and less on platform-specific optimizations.

As for the difference between A and B, it's plausible that the OpenSSL versions (1.1.1 vs 3.0.2) lead to selecting different protocol versions or cipher suites. Though I'd expect the RSA private key operation to dominate performance in all cases, based on your measurements. (Tip: with an ECC (ECDSA) certificate, the server handshake performance is a lot better than with RSA; the client performance is worse but the total CPU cost over both sides is significantly less with ECC.)

ishkhan42 commented 1 year ago

I've done more tests on three machines where the openssl version is > 3.0.0, and on all of them, the mbedtls server performed nearly 16-23 req/s. On a very low-end CPU, it performed near 16 req/s, while on middle and high-end CPUs, it performed near 23 req/s.

However, even if the openssl version is 1.1.1, on lower-end CPUs, mbedtls performed poorly and had the same low request rate. On higher-end CPUs with openssl version 1.1.1, mbedtls can reach up to 100 req/s.

Therefore, the degraded performance is affected by the openssl version, but not always, as on a very low-end CPU, I can get about 40 times better performance with OpenSSL, even when the version is 1.1.1.

Isn't 40x slower a problem ?

gilles-peskine-arm commented 1 year ago

40x slower is a problem, but it's not clear to me that it's a problem that the Mbed TLS team will try to address. We know that there's a reason we find acceptable (different security and portability policies) for part of the difference. Another part of the difference is something we don't yet know how to reproduce, and where it isn't even clear that the problem is in Mbed TLS rather than in OpenSSL.

Even if we do manage to identify a problem in Mbed TLS, performance is less important to us than code size, and RSA performance is less important to us than ECC performance, based on how and where Mbed TLS tends to be used. So I'm afraid this would be a relatively low priority concern.

ashvardanian commented 1 year ago

@gilles-peskine-arm 40x slowdown sounds horrifying. Any way this can be prioritized?

gilles-peskine-arm commented 1 year ago

@ashvardanian From our perspective, it's not at all clear that there's a 40x slowdown in our project, and we can't debug the original submitter's system. If we had a clear way to reproduce a 40x speed difference between two comparable things, then we'd try to improve it.

But here we're confronted with an investigation where we can't reproduce this 40x ratio, and the closest we can get to reproducing involves a lot of moving parts including two machines' network stack. For all we know, the main factor could be some network setting about packet timing or port reuse and not the cryptographic computation. Or (re-reading this I think we haven't ruled it out) it could be that the negotiation ends up selecting different cryptographic parameters.

We are not actively working on this issue. From our perspective, it's a major project setting up a complex benchmarking system in the hope that we can actually reproduce the issue, then analyze the performance differences to find the root cause. We simply don't have time for this.

If this problem affects you and you want us to prioritize this, please give us a reproducible way to trigger the problem. For example: when the client is mbedtls with such-and-such configuration and the server is openssl with such-and-such configuration, then mbedtls picks this ciphersuite, but an openssl client would pick that ciphersuite which has better performance. Or: mbedtls does N RSA signatures per second on this hardware, but openssl does 40*N for the same key on the same hardware. Then this issue would have a precise objective, and would be something we can realistically work on.