ECDSA_P256 performance (very low?)

Raincode commented 4 years ago

Hi,

I am running some performance measurements for ECDSA using the NISTP256 curve. When using the internal Core API, I achieve only one to two operations per second (!), which is very slow.

I would like to know whether anyone else gets different results. The command used was

xtest --crypto-perf -alg ECDSA_P256 -k 256 -n 50

I used a custom test application and got the same results.

Thank you in advance!

P.S.: I used mbedTLS, too, and compared it to running in GNU/Linux normal world:

mbedTLS achieves sign/verify about 230/70 per second. When running the same code within OP-TEE (TA), I achieve sign/verify about 27/7 per second, so about 10x slower. Any ideas what the issue could be?

I compile with platform-cflags-optimization = -O2

And commented out the line thread_set_foreign_intr(true);

in /core/arch/arm/tee/entry_std.c

jforissier commented 4 years ago

Where did you get xtest from? Mine certainly doesn't have the --crypto-perf option :open_mouth:

Raincode commented 4 years ago

Ooops... I wasn't aware of that. I have a port for my specific development board: https://source.codeaurora.org/external/imx/imx-optee-test

I guess that very quickly limits the possibilities of comparing results lol. In that case I will likely close the issue again... Unless anyone has ideas where specific performance bottlenecks could be happening or how to debug such bottlenecks (starting from TEE_AsymmetricVerifyDigest call)

jforissier commented 4 years ago

Ooops... I wasn't aware of that. I have a port for my specific development board: https://source.codeaurora.org/external/imx/imx-optee-test

I can only see libteec and tee-supplicant source at this URL!?!

I guess that very quickly limits the possibilities of comparing results lol.

Obviously, yes :wink: But I wouldn't mind trying your test case on one of my boards (HiKey960 or HiKey 620), if only I could build it here.

In that case I will likely close the issue again... Unless anyone has ideas where specific performance bottlenecks could be happening or how to debug such bottlenecks (starting from TEE_AsymmetricVerifyDigest call)

I have no idea from the top of my head... Perhaps the function tracer could come handy here? https://optee.readthedocs.io/en/latest/debug/ftrace.html

Raincode commented 4 years ago

My bad, I have to select a different branch/tag: https://source.codeaurora.org/external/imx/imx-optee-test/tree/?h=rel_imx_4.14.98_2.0.0_ga

It appears ftrace even displays the individual runtime, like a profiler. If it's not too difficult to set up, this could be very useful! It must be something which affects both mbedTLS and the internal core API / cryptographic provider implementation, i.e. libtomcrypt. I will update if I find any new results. I don't think its the random number generation, because to my knowledge ECDSA_Verify does not require a random number.... Additionally, I remember doing a SHA-256 benchmark with mbedTLS (in GNU/Linux and OP-TEE) and the performance was identical. So the bottleneck seems ECDSA-specific as well. Additionally, the mbedTLS source code for both GNU/Linux and included in OP-TEE OS is identical.

Edit: I looked into this, but it seems my release has relatively old OP-TEE versions (dating back to late 2018). Probably ftrace was not integrated yet at the time. As I understand, optee depends on the Linux Kernel version so an upgrade would not be trivial. I will try using gprof instead.

I tried gprof now, but the time % column is always 0.00 for all functions. Could this be because I compiled optee os with foreign interrupts disabled? Or should it still work.

@jforissier: Interestingly, I found your gist here: https://gist.github.com/jforissier/91da987925f2e9a7e86a0a84f1f25909

Which has a nearly identical problem. But I cannot find the reference to it.

Edit 2: Disabling foreign interrupts or not had no effect on my gprof issues.

For the record, here is a fork of optee-examples with my test-code: https://github.com/Raincode/optee_examples/tree/master/ecdsa_perf

jforissier commented 4 years ago

The test didn't work for me until I applied the following fix:

diff --git a/host/xtest/crypto_perf.c b/host/xtest/crypto_perf.c
index 8db8f593..ce50e154 100644
--- a/host/xtest/crypto_perf.c
+++ b/host/xtest/crypto_perf.c
@@ -684,7 +684,7 @@ static int asym_digest(int verbosity, uint32_t alg_id,
                TEEC_VALUE_OUTPUT,
                TEEC_NONE);
        op.params[0].tmpref.buffer = (char *)hash_name[hash_id];
-       op.params[0].tmpref.size   = strlen(hash_name[hash_id]);
+       op.params[0].tmpref.size   = strlen(hash_name[hash_id]) + 1;

        op.params[1].value.a = 0; /* Not used */

@@ -767,7 +767,7 @@ static int prepare_algo(struct test_param *test,
                TEEC_VALUE_OUTPUT,
                TEEC_NONE);
        op.params[0].tmpref.buffer = test->alg;
-       op.params[0].tmpref.size   = strlen(test->alg);
+       op.params[0].tmpref.size   = strlen(test->alg) + 1;

        op.params[1].value.a = test->keysize;
        op.params[1].value.b = test->bufsize;

(I would happily submit the fix to the original repo but I have no idea how to do it).

WIth this, and on the master branch, my results are:

root@HiKey960:/ # HiKey960 A73 cores only:
root@HiKey960:/ for i in 4 5 6 7; do echo 1 >/sys/devices/system/cpu/cpu$i/online; done
root@HiKey960:/ for i in 0 1 2 3; do echo 0 >/sys/devices/system/cpu/cpu$i/online; done
...
root@HiKey960:/ xtest --crypto-perf -alg ECDSA_P256 -k 256 -n 50

Test Result for ECDSA_P256
** Preparation (Key Generation) **
time=76571.4μs
** Encryption direction **
min=76269.8μs max=77299.5μs mean=76709.1μs stddev=241.963μs (0.000248647 MiB/s)
** Decryption direction **
min=50989.6μs max=51559.9μs mean=51193.5μs stddev=138.999μs (0.00156482 MiB/s)

root@HiKey960:/ # HiKey960 A53 cores only:
root@HiKey960:/ for i in 0 1 2 3; do echo 1 >/sys/devices/system/cpu/cpu$i/online; done
root@HiKey960:/ for i in 4 5 6 7; do echo 0 >/sys/devices/system/cpu/cpu$i/online; done
...
root@HiKey960:/ xtest --crypto-perf -alg ECDSA_P256 -k 256 -n 50

Test Result for ECDSA_P256
** Preparation (Key Generation) **
time=159828μs
** Encryption direction **
min=159716μs max=160370μs mean=160007μs stddev=135.187μs (0.000119204 MiB/s)
** Decryption direction **
min=108724μs max=108767μs mean=108751μs stddev=8.84073μs (0.000736622 MiB/s)

That is about 20/13 decrypt/encrypt per second on the A73 cores and 9/6 on the A53s. Same order of magnitude as your results at least...

Raincode commented 4 years ago

@jforissier Wow, thank you for running those tests and sharing your results!

This is quite interesting, because running openssl speed ecdsap256 gives me roughly 300 verifications/s (signs is higher, because OpenSSL still can do some sort of pre-computation on the curve to accelerate EC encrypt by a factor of approx. 4x).

I'm not surprised if OpenSSL is highly optimized, but the difference is still quite large (two orders of magnitude). I'm especially struggling to understand why using the same library (mbedTLS) in GNU/Linux and OP-TEE has a factor 10x performance difference for ECDSA (it is important to note that I only perform a single TEEC_InvokeCommand to run the entire performance test, i.e., context-switch overhead is exluded).

At least I know my results are more or less reproducible, and there is not some fundamental error in my setup. Thanks again!

jenswi-linaro commented 4 years ago

@Raincode one thing to note is that we're not linking OP-TEE with libgcc so that could have some impact.

Raincode commented 4 years ago

@Raincode one thing to note is that we're not linking OP-TEE with libgcc so that could have some impact.

I'm not sure what the implications of this would be. Meaning, if calls to libgcc are made, they may be implemented differently than for a regular GNU/Linux application (and thus provide potentially different performance)?

On a sidenote, I tested other things as well to form a baseline (such as AES and SHA). Interestingly, when I ran the identical AES test using mbedTLS in GNU/Linux and OP-TEE, the throughput was practically identical. Furthermore, by selectively disabling certain parts of the mbedTLS and OP-TEE (libtomcrypt) source code (compiled into optee-os), I came to the conclusion that indeed the actual ECC computations seem to be the bottleneck (as opposed to some other steps happening before/after).

I thought about cache misses somehow playing a role, but then again, why would there be such a significant difference running on the same CPU...

jforissier commented 4 years ago

I came to the conclusion that indeed the actual ECC computations seem to be the bottleneck (as opposed to some other steps happening before/after)

That would be my conclusion too. I have run the test case with function tracing enabled (make -j10 CFG_FTRACE_SUPPORT=y CFG_SYSCALL_FTRACE=y CFLAGS_ta_arm32="-O0 -pg" CFG_FTRACE_BUF_SIZE=10000000 CFG_FTRACE_BUF_WHEN_FULL=wrap) and nothing stands out... Just a lot of small operations to perform multiplications, modulus etc. FYI here is what it looks like: ftrace-cb3e5ba0-adf1-11e0-998b-0002a5d5c51b.txt.

It is a partial trace though (the tracing buffer is too small to hold the full trace), so I might have missed something.

Perhaps some similar tracing of your normal world reference code run could help compare things?

I thought about cache misses somehow playing a role, but then again, why would there be such a significant difference running on the same CPU...

Yeah, could NW and SW be configured differently wrt. cache perhaps? Or speculative execution, etc.?

jenswi-linaro commented 4 years ago

If it's caches, speculation and so on then it should be visible in normal code too. If this problem could be reproduced with some simpler and more isolated code it should be much easier to analyze.

Raincode commented 4 years ago

@jforissier thank you again for your support and providing the ftrace output. What I find additionally interesting are the many memset//grow/copy/free/mempool etc. operations. I only have a very basic understanding of ECC, but it seems there shouldn't be as many memory operations required.

~~Now I just had an idea that maybe a lower RAM availability inside the TA (stack/heap) could be the cause.~~ Edit: I just tried doubling heap size from 128 KiB to 256 KiB and couldn't observe any difference

Yeah, could NW and SW be configured differently wrt. cache perhaps? Or speculative execution, etc.?

This is certainly an interesting aspect. It appears Arvmv8-A has certain Performance Monitors and Statistical Profiling Extension registers which I could further investigate.

I'm not sure whether there are separate/banked performance configurations for non-/secure state, but I will look into this and reply back if I have any new results.

If it's caches, speculation and so on then it should be visible in normal code too. If this problem could be reproduced with some simpler and more isolated code it should be much easier to analyze.

If I understood @jforissier correctly, the idea was that maybe these things could have separate configurations for the two TrustZone worlds. I could definitely write a TA with just a single command computing a single ECDSA verification.

jforissier commented 4 years ago

If it's caches, speculation and so on then it should be visible in normal code too. If this problem could be reproduced with some simpler and more isolated code it should be much easier to analyze.

If I understood @jforissier correctly, the idea was that maybe these things could have separate configurations for the two TrustZone worlds.

That's what I had in mind, yes, but I would trust Jens more on this kind of things ;-)

Raincode commented 4 years ago

Unfortunately, I currently don't have the time/resources to further investigate this topic in depth. To summarize:

There seems to be a performance bottleneck on ECDSA_P256 operations von Cortex-A53 cores executing in S-EL1, independent of SW-crypto libraries (mbedTLS / libtomcrypt) and of SoC (i.MX 8M and HiKey).

It is not clear whether this applies to other ECDSA curves as well
The computations for the ECDSA verification seem to be the bottleneck (as opposed to some other operations involved such as parameter passing etc.)
However, the exact effect during these computations is not clear

Thank you for your advice and support on this matter.

OP-TEE / optee_os

ECDSA_P256 performance (very low?) #3579