briansmith / crypto-bench

Benchmarks for crypto libraries (in Rust, or with Rust bindings)
70 stars 11 forks source link

Add Ed25519 verification benchmark, test file #28

Closed ranweiler closed 4 years ago

ranweiler commented 7 years ago

See: #25.

ranweiler commented 7 years ago

Right now, this implementation has a few issues:

  1. It takes a long time to run (almost 7 minutes on my machine)
  2. Individual test cases do not constitute individual benchmarks
  3. We have pressure to keep crypto-bench's copy of the test file in sync with ring's.

I think (2) would be best addressed by generating the bench functions using a procedural macro. Another (complementary) option is to manually split ed25519_tests.txt into separate files in a subfolder, and use a (regular) macro (manually invoked) to create benchmarks from each file. We could address (1) by choosing a "representative" subset of the existing test vectors, making the regular macro/file split approach more tractable.

Even if we could easily generate the 513 verification benchmarks that would correspond to each test vector in the ported file, it isn't clear that we'd want to. It would be an interesting experiment to use a script to generate a module with one bench function per test vector, run them, and see what the maximum performance discrepancy is, identifying some outlier inputs to use for future benchmarks. This would give us a principled notion of "representative", and ultimately address all of the enumerated issues.

However, I'm starting with this ~"batch"~ aggregate benchmark of all inputs, because it is at least comprehensive, and gives me data I can use in the work on briansmith/ring#479.

isislovecruft commented 7 years ago

Hi @ranweiler!

The idea to generate all verification benchmarks as separate function and then find the outlier inputs for future use is interesting, although I'm not entirely sure if the extra work would get us much in terms of comparing implementations. We would learn that the non-adjacent form of some scalar has more 1s than another, and thus the variable-time scalar multiplication in verification takes longer. If they are all chosen randomly, this distribution of timings should be uniform. So on some machines, one input is going to be faster or slower than the same input on another machine, but the underlying distribution should still hold. If not, there is something weird happening. And—while computers can be weird machines—it seems unlikely we'll discover a weird machine within a cryptographic routine given only 513 static inputs. :)

Is it necessary to use the test vectors file for benchmarks? If not, and with the assumptions above, it might be simpler (for batch verification) just to generate a keypair, randomly generate a batch of N short message (where "short" means "smaller than SHA-512's block size of 128 bytes", that way the message hashing doesn't dominate the elapsed time), use the key to sign each message, and then verify the batch of messages. For single signature verification benchmarks, one could instead generate N keypairs, sign N blank messages, and verify.

(Or, alternatively, you could randomly select N cases from the test vectors, we'll all use those same N, and we'll all just assume that you selected randomly and that the original test inputs were chosen in an unbiased manner.)

briansmith commented 7 years ago

It takes a long time to run (almost 7 minutes on my machine)

It probably takes a long time because 1 cycle through the function takes a relatively long time, and Rust's benchmarking stuff has a minimum number of iterations. Nor do we particularly care about the performance of the case where signature verification fails (right?); we can just assume it's not too bad. In practice this means we need to have a small number of test vectors, performance-wise.

For this, I think we're not really interested in the best cases, but we are interested in the average cases and worst cases. I think most implementations are going to be using the same approach, but maybe they might make different window size choices. So, I think we should a-priori calculate the average and worst cases for a few likely window sizes, and then just choose test vectors that happen to have exactly those performance characteristics.

briansmith commented 7 years ago

Also, ...

We have pressure to keep crypto-bench's copy of the test file in sync with ring's.

It's OK if they diverge. They probably should diverge because they have different use cases.

I think (2) would be best addressed by generating the bench functions using a procedural macro.

I don't want to use procedural macros or really anything fancy unless it is a big win.

IMO, these benchmarks shouldn't be too different than the other benchmarks other than having multiple test vectors involved. I think we can should find a way to keep the multiple-test-vectors aspect simple so that everything else stays simple.

briansmith commented 7 years ago

Finally, you use the word “batch” but I think you're not really trying to benchmark batch verification, right? So, we should just avoid the word "batch". I'd prefer to do single verification now and defer batch verification to later.

ranweiler commented 7 years ago

@isislovecruft:

[...] although I'm not entirely sure if the extra work would get us much in terms of comparing implementations.

Agreed, and the more implementations we add, the less it makes sense for this purpose.

Is it necessary to use the test vectors file for benchmarks?

Not in my mind. I wanted deterministic inputs, and it was convenient to just re-use the existing ring test infrastructure and vectors (since they were a source of "free" valid message/signature pairs). It was especially "obvious" in this context since they're derived from the Ed25519 regression test data.

randomly generate a batch of N short message (where "short" means "smaller than SHA-512's block size of 128 bytes", that way the message hashing doesn't dominate the elapsed time), use the key to sign each message, and then verify the batch of messages.

I prefer this idea overall, especially since I don't know exactly how the test vectors were generated, and for our purposes, making sure we are benchmarking a random sample seems most important to to measure average behavior. I would want to generate a fixed set of messages in advance to minimize noise in comparisons (maybe this is what you meant). So, we could re-use the approach in this PR, but just generate and use a new file. What do you think?

ranweiler commented 7 years ago

@briansmith:

I think you're not really trying to benchmark batch verification, right?

Oops, I definitely just meant "bunch of single verifications". I'll re-word the commit.

Nor do we particularly care about the performance of the case where signature verification fails (right?)

Agreed.

For this, I think we're not really interested in the best cases, but we are interested in the average cases and worst cases. I think most implementations are going to be using the same approach, but maybe they might make different window size choices. So, I think we should a-priori calculate the average and worst cases for a few likely window sizes, and then just choose test vectors that happen to have exactly those performance characteristics.

Agreed re emphasis on average and worst cases. In the interest of keeping things simple, and minding your earlier comment about adding more benchmarks as needed, what do you think about starting with a fixed set of randomly generated vectors, as discussed above?

briansmith commented 7 years ago

what do you think about starting with a fixed set of randomly generated vectors, as discussed above?

I think it is fine. If/when some implementers change their windowing schemes or whatnot, they can evaluate the fairness/sensibility of the test vectors and submit improvements if warranted.

briansmith commented 4 years ago

Thank you. Unfortunately, I just don't have time to work on this project so I'm going to stop accepting contributions for a while.