Performance regression in mbedtls_mpi_exp_mod() (v3.6.0)

jforissier commented 3 weeks ago

We observed that the mbedtls_mpi_exp_mod() function takes much more time to complete in v3.6.0 than in v3.5.2 (about 16 times more). The platform is OP-TEE OS running in QEMU (arm32). See https://github.com/OP-TEE/optee_os/pull/6797 and in particular this commit which brings the function back to the v3.5.2 and solves the issue: https://github.com/jforissier/optee_os/commit/f147fc17b69efc9f22d122466d7b6b9678a6ad61

Any idea what's wrong?

yanesca commented 3 weeks ago

Hi @jforissier, we have switched to a different algorithm to improve the security (constant time properties) of the algorithm. There is some performance cost of this change, but that 16x multiplier sounds more significant than what we expected.

The new algorithm uses memory in a different way than the old one and the window size that was optimal for the old one on a platform, might be inefficient for the new one. Does it help if you try to change the window size MBEDTLS_MPI_WINDOW_SIZE ?

jforissier commented 3 weeks ago

Hi @yanesca, thanks for the advice. Unfortunately not much impact. With MBEDTLS_MPI_WINDOW_SIZE set to 1, 2, 3, 6 and 7 my test takes 34, 25.6, 23.2, 22 and 21.4 seconds respectively. With v3.5.7 it takes 1.2 second.

mpg commented 2 weeks ago

Hi @jforissier, did you also measure performance on real hardware? I tried it on two boards, based on Cortex-M0 and Cortex-M4 cores, and the performance regression compared to 2.25 (the "old" version that was most convenient for me to test as it's the one in Mbed OS) seemed much less drastic.

STM NUCLEO F091RC (M0 48MHz)

1024-bit 2.25 exp_mod, cold r: 2150 ms 1024-bit 2.25 exp_mod, hot r: 2130 ms 1024-bit 3.6 exp_mod, cold r: 2584 ms 1024-bit 3.6 exp_mod, hot r: 2564 ms

Freescale FRDM-K64F (M4 120MHz)

1024-bit 2.25 exp_mod, cold r: 248 ms 1024-bit 2.25 exp_mod, hot r: 244 ms 1024-bit 3.6 exp_mod, cold r: 319 ms 1024-bit 3.6 exp_mod, hot r: 315 ms

Here the new code is only about 30% slower, which is very different from the factor 16 you're observing. (On my laptop based on an x86-64 CPU, the difference between 3.5.1 (with it s default config) and 3.6.0 (with its default config) is too small to be reliably visible using the simple benchmark below.

#include "mbedtls/bignum.h"
#include "bignum_core.h"
#include <stdio.h>
#include <time.h>

#define TIMEIT(TITLE, CODE)                                         \
do {                                                                \
    for (size_t i = 0; i < 1000; i++) {                             \
        CODE;                                                       \
    }                                                               \
    clock_t start = clock();                                        \
    for (size_t i = 0; i < 1000; i++) {                             \
        CODE;                                                       \
    }                                                               \
    clock_t end = clock();                                          \
    double duration = (double)(end - start) / CLOCKS_PER_SEC;       \
    printf("%s: %f ms\n", TITLE, duration);                         \
} while (0)

// python: hex(random.randrange(2 ** 1024))
const char *A = "49dd4a46f532d1e3c395875d0f43747367e1fe078ac571a541e17c1e1a505cbb9df94e8fbe4b6acee05134a3ccb8780cbf19abded6dee43af23c547046ab309090eccfd8d837ce33b7489421c9a93201b7598f32bf4d97de5302604d581e31e1d1f786f826d00678c62222d0ed97910a786a9ca2a24cd768ee4241e761d71494";
const char *E = "8e51ee5226332eea3e694e62919fe8a3e2c6be653cd63d5c87027848636293dfe42025ab64df0f75cacab381400e4b734b08b4cbb1c4a28f699ae81dab87838f501f090e72e5d546f8de82f273416c204fa66176c0b48aea5df5cb081b80b56e63e3bfa545157b259baeebcdbf50b7299953440a681f414ea4cfca9d8f2c0b91";
const char *N = "ff3f5df10b8db6aab53eb55c3c979ce9250287fb48ac031b650ecbe35a42b2f0d2edfdb2252023e0b5769574ba0d6e7a5dec6989b75c82bc0364a0f24c7e3acdd12720573b8bdbd879d65cf3d8fb7ae9324774bb910aac64dc9f233ff58472809ec260089ef66b304368e86b1d159ce330867c3c49758757305f3de52bf958a7";

int main(void) {
    mbedtls_mpi x, a, e, n, r;

    mbedtls_mpi_init(&x); mbedtls_mpi_init(&a); mbedtls_mpi_init(&e);
    mbedtls_mpi_init(&n); mbedtls_mpi_init(&r);

    mbedtls_mpi_read_string(&a, 16, A);
    mbedtls_mpi_read_string(&e, 16, E);
    mbedtls_mpi_read_string(&n, 16, N);

    TIMEIT("1024-bit exp_mod, cold r", mbedtls_mpi_free(&r); mbedtls_mpi_exp_mod(&x, &a, &e, &n, &r));
    TIMEIT("1024-bit exp_mod,  hot r", mbedtls_mpi_exp_mod(&x, &a, &e, &n, &r));
}

There must be some difference between your setup and the ones I tried that explains the large difference. Could it be:

QEMU vs real hardware?
OP-TEE doing something that would make memory accesses 1 order of magnitude smaller?
a combination of the above (like QEMU's emulation of some memory protection mechanisms being very slow)?
a difference in mbedtls_config.h (I used the default one) or compile options (I used (arm)-gcc -O2)?

That's all I can think of for now. Please let us know if you can test some of these ideas, and/or have others.

mpg commented 2 weeks ago

FWIW, I've also done a few measurements on A-class (v7 and v8) and in both cases it looks like the change of the default window size was enough to compensate for the new, more secure algorithm being a bit slower.

v7-A (Cortex-A7)

3.5.1 1024-bit exp_mod, cold r: 40.122913 ms 1024-bit exp_mod, hot r: 39.805613 ms 3.6.0 1024-bit exp_mod, cold r: 38.286153 ms 1024-bit exp_mod, hot r: 38.024482 ms

v8-A (eMag)

3.5.1 1024-bit exp_mod, cold r: 1.776005 ms 1024-bit exp_mod, hot r: 1.763511 ms 3.6.0 1024-bit exp_mod, cold r: 1.587849 ms 1024-bit exp_mod, hot r: 1.576246 ms

mjelintacharge commented 2 weeks ago

@mpg I'm seeing a very big performance impact on the STM32F207 devices (120MHZ Cortex-M3 core). I have tried everything, even going -O3, however the performance is still horrible.

The small (or rather, 'bearable') performance difference only appears with certain large exponents. However, the problem reveals itself when doing math on some popular public RSA exponents, such as 10001 (65537)

Here are some benchmarks using the test code that you have provided:

const char *A = "49dd4a46f532d1e3c395875d0f43747367e1fe078ac571a541e17c1e1a505cbb9df94e8fbe4b6acee05134a3ccb8780cbf19abded6dee43af23c547046ab309090eccfd8d837ce33b7489421c9a93201b7598f32bf4d97de5302604d581e31e1d1f786f826d00678c62222d0ed97910a786a9ca2a24cd768ee4241e761d71494";
const char *E = "8e51ee5226332eea3e694e62919fe8a3e2c6be653cd63d5c87027848636293dfe42025ab64df0f75cacab381400e4b734b08b4cbb1c4a28f699ae81dab87838f501f090e72e5d546f8de82f273416c204fa66176c0b48aea5df5cb081b80b56e63e3bfa545157b259baeebcdbf50b7299953440a681f414ea4cfca9d8f2c0b91";
const char *N = "ff3f5df10b8db6aab53eb55c3c979ce9250287fb48ac031b650ecbe35a42b2f0d2edfdb2252023e0b5769574ba0d6e7a5dec6989b75c82bc0364a0f24c7e3acdd12720573b8bdbd879d65cf3d8fb7ae9324774bb910aac64dc9f233ff58472809ec260089ef66b304368e86b1d159ce330867c3c49758757305f3de52bf958a7";

V3.6 Bignum: 1024-bit exp_mod, cold r: 630.500000 ms 1024-bit exp_mod, hot r: 627.500000 ms

V3.5.2 Bignum: 1024-bit exp_mod, cold r: 488.300000 ms 1024-bit exp_mod, hot r: 485.300000 ms

~23% difference, seems acceptable?

const char *A = "49dd4a46f532d1e3c395875d0f43747367e1fe078ac571a541e17c1e1a505cbb9df94e8fbe4b6acee05134a3ccb8780cbf19abded6dee43af23c547046ab309090eccfd8d837ce33b7489421c9a93201b7598f32bf4d97de5302604d581e31e1d1f786f826d00678c62222d0ed97910a786a9ca2a24cd768ee4241e761d71494";
const char *E = "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010001";
const char *N = "ff3f5df10b8db6aab53eb55c3c979ce9250287fb48ac031b650ecbe35a42b2f0d2edfdb2252023e0b5769574ba0d6e7a5dec6989b75c82bc0364a0f24c7e3acdd12720573b8bdbd879d65cf3d8fb7ae9324774bb910aac64dc9f233ff58472809ec260089ef66b304368e86b1d159ce330867c3c49758757305f3de52bf958a7";

V3.6 Bignum: 1024-bit exp_mod, cold r: 630.500000 ms 1024-bit exp_mod, hot r: 627.500000 ms

V3.5.2 Bignum: 1024-bit exp_mod, cold r: 10.200000 ms 1024-bit exp_mod, hot r: 7.200000 ms

~63 times slower?!!!

yanesca commented 2 weeks ago

@mjelintacharge Does the performance regression persist if you define E as

const char *E = "10001";

or alternatively calling mbedtls_mpi_shrink() on E before using it?

mjelintacharge commented 2 weeks ago

@mjelintacharge Does the performance regression persist if you define E as
const char *E = "10001";
or alternatively calling mbedtls_mpi_shrink() on E before using it?

Using mbedtls_mpi_shrink(..) does indeed help. However, even when using that I'm only getting ~100ms per operation versus the ~10ms on 3.5.2 (which is still super significant slowdown)

mpg commented 1 week ago

Thanks for the new data. This makes a lot of sense: we made our exp_mod function constant-time in 3.6, which is needed for the security of private operations: it should not leak any information on the private exponent through timing (except for its number of limbs, which is public anyway). And indeed in this message we can observe that in 3.6 we get the exact same timing whether e is a 17-bit value of a full 1024-bit value. Which is comforting from a security perspective - our function behaves as expected :)

But of course the public operations don't need to protect the exponent that way, as it's a public value. So, we should avoid wasting time protecting it. There's obviously a trade-off between performance of public operations and code size here: for code size we want the functions used for public and private operations to be as similar as possible, but for optimal performance we'd need them to be pretty different... We're exploring the solution space and will keep you posted.

mpg commented 1 week ago

I'm labelling this "bug" and adding to the 3.6.1 EPIC because I don't think a 10x regression on RSA public key operations is OK. (And even more if for some reason e has more limbs than necessary.) RSA public operations as expected to be fast (that's one notable difference with ECDSA for example, where signature generation can be faster than verification).

Quick results on my laptop (64-bit Intel):

3.5.1

1024/1024-bit exp_mod, cold r: 1.071714 ms
1024/1024-bit exp_mod,  hot r: 1.083624 ms
  17/1024-bit exp_mod, cold r: 0.022744 ms
  17/1024-bit exp_mod,  hot r: 0.016366 ms

3.6.0

1024/1024-bit exp_mod, cold r: 0.961081 ms
1024/1024-bit exp_mod,  hot r: 1.050500 ms
  17/1024-bit exp_mod, cold r: 0.100309 ms
  17/1024-bit exp_mod,  hot r: 0.177438 ms

Typical public RSA operations are the last line (17-bit exponent, hot r).

mpg commented 1 week ago

See first draft of an attempt to reduce the regression: https://github.com/Mbed-TLS/mbedtls/pull/9281

Mbed-TLS / mbedtls

Performance regression in mbedtls_mpi_exp_mod() (v3.6.0) #9232