LoupVaillant / Monocypher

An easy to use, easy to deploy crypto library
https://monocypher.org
Other
594 stars 79 forks source link

Proper handling of non-portable optimisations #85

Closed LoupVaillant closed 6 years ago

LoupVaillant commented 6 years ago

Monocypher has two significant bottlenecks at the moment: Chacha20, and Argon2i. This makes libsodium 3.6 times as fast on authenticated encryption, and 57% faster at password hashing. The former is such a huge gap that many will take it as a deal breaker. The latter is not as brutal, but its incentive structure means performance directly impacts security.

So I've started an experimental speed branch to try and go beyond purely portable C code. So far, I've integrated the vectorised implementation of Chacha20 from Romain Dolebeau. The question is, how should we handle this code?

Right now, I (very crudely) rely on the __AVX2__ pre-processor definition, which on GCC is activated whenever the architecture supports AVX-2 instructions (modern Intel and AMD processors do). Obviously, this is not enough: part of the code is actually AVX compatible (only use 128-bit registers), and could be activated separately on AVX-2, AVX, MMX, and perhaps others I have yet to be aware of.

Still, those pre-processor options may not be exactly what we want. This is non-portable code we're talking about. Is it good practice to rely on those prep-processor definitions? If so, should we enable optimisations by default? If so, should we provide a way to disable it even for users who use -march=native?

My inclination would be to activate optimisations by default, so that users who compile non-portably (either enabling AVX or such, or by using -march=native) could benefit from the optimisations without thinking about it. Rationale being, if you compile non-portably, you're not portable anyway, so you might as well use those options to the fullest. I just fear unanticipated problems.

Oh, and I'd rather not have the user "initialise" the library like Libsodium does.

@CuleX , @mikejsavage, @secworks, @anywone, what do you think?

mikejsavage commented 6 years ago

from a games point of view:

people still run like, core 2, so we need to be able to fallback to sse2

i don't think you can tell gcc to only use avx2 instructions in certain parts of the code, and you don't want it introducing avx2 into the fallback code, so you have to compile your avx2 and fallback code as separate compilation units

so if you want to support games it needs to be in a separate file with a different name so we can pick the implementation at runtime

LoupVaillant commented 6 years ago

Can we leave that runtime selection to the game dev? Like, let them compile several versions of Monocypher with various -march=* options, then let them select the correct .so/.dll at runtime? Since the total binary size of Monocypher is only 65K at the most greedy optimisation options, this shouldn't be a problem, right?

(Also, consider that encryption for games is likely to be limited to network code. This should take very little computation even with the portable C implementation.)

By the way my port of the Dolbeau implementation is incomplete: I have omitted the SSE/AVX compatible quad rounds, which use 128-bit vectors, because I figured they would cost too much code (up to 160 lines) for too little performance (when AVX-2 is available). Still, we're looking at a 2X speedup on older processors. Now how many lines of code are we willing to pay for this? I'd rather not exceed 80.

mikejsavage commented 6 years ago

putting encryption code in a separate dll makes it too easy for people to reverse engineer the game. that said it's not very hard to pull the implementations out and rename them because monocypher is so small

(Also, consider that encryption for games is likely to be limited to network code. This should take very little computation even with the portable C implementation.)

true. clients send/receive trivial amounts of data and servers don't need fallback code

not sure SSE implementations are worth it then. at least I can't imagine people doing performance critical crypto work on 7 year old hardware

ghost commented 6 years ago

On Unportable Code in General

Maybe mistakenly, I was always under the impression that Monocypher is closer to TweetNaCl than libsodium in the sense that it's just drop in and go, no decisions necessary, no surprises on your average platform.

What you're doing now is giving not only choice about the functions to call, which admittedly the incremental interfaces stretched fairly much, but also a choice about how to deploy. You're adding actual ifdefs in the main code. Everything's going to spike in complexity. I'd like for you to keep in mind that more code is always a potential vector for failure, no matter how many tests you have. I'd like to remind you about the tragic death crypto_memcmp had to die.

I've never viewed Monocypher as an x86_64 library but rather as a C library, the tradeoff of being portable is not being the absoltue fastest.

Personally, I'm wholeheartedly opposed. However, if you believe this is the path forward, then it is.

On ifdefs

If you must, it would be more proper to avoid ifdefs and instead move out the unportable code into separate source code files. This goes completely against the single-file idea and the "easy to deploy" goal, however. You're now libsodium with less platforms and algorithms, just with more formal verification.

Optimizations by Default

If a user happens to compile code for a platform that supports the required inline assembly and you can detect it at compile-time, it should be the default. If you cannot detect it at compile-time, I would argue that ease of use on various platforms beats speed.

Disabling Optimizations

About whether there should be a way to disable the optimizations, I would generally say yes, there should be unless all assembly code is externalized to separate files. Inline assembly is incredibly dangerous and can lead to subtle bugs that rely on compiler-dependent behavior. I would be surprised if even half the users run the test suite, but if they do and detect an issue there, they should have a way to get back on the "known sane" path.

Games and DRM

Yes, Monocypher must not be built into a separate library. The symbol names would immediately give it away, too. Having function names and parameters (by virtue of just checking the code here) exposed makes things a lot easier. Encryption may be one of the cornerstones of DRM and communication with the game server. You don't want to give people easy outs, even if your crypto checks out. In particular, just hooking calls into a DLL and then reading plaintext communications would be a disaster. Ideally, these kinds of functions are obfuscated.

secworks commented 6 years ago

I agree 100% with @CuleX here.

Monocypher is still comparatively fast. And it is fast on different platforms (x86, ARM, probably RISC-V, AVR32). ChaCha20 encrypts/decrypts in hundreds of MB/s - GB/s speeds (depending on platform, clock speed). And passwpord hashing is done quite infrequently. So, yes there are quite a bit faster implementations in other libs. But would anybody really dismiss Monocypher because of this? Has that really happened? I'm doubtful a user would even note the processing time.

I would strongly advice not to start moving towards arch specific optimizations, ifdefs of the code etc. Esp not unless there are a lotof requests of doing so. No options, variants, simple, easy to use, portable C that protects against many typical mistakes is imho good goals for Monocypher.

konovod commented 6 years ago

Having a fast implementation is imho better than not having one (and necessity to switch to another library if more speed is required). So I don't really care if it will be enabled by default, or be in another branch, or even in a file "more_speed.c" like in older versions, but it would be nice to have it.

LoupVaillant commented 6 years ago

My, I didn't expect that push back. Looks like I've turned greedy. Okay, you win: Monocypher will stay portable. No architecture specific shenanigans, no #ifdef, no additional code path that would be hard to test… That experiment gave me valuable information (especially with respect to #87), but let's turn back before it is too late. Let Libsodium fill the high-performance niche it already does, and let Monocypher be the better TweetNaCl it was meant to be.

The problem with performance is that it is very easy to measure. This tends to create a streetlight effect, and tempt us to optimise for it even where it is not warranted. Daniel Bernstein said in the TweeNaCL paper that "Portability and small code size come at a loss in efficiency, but TweetNaCl is sufficiently fast for most applications." But Monocypher is 5 to 10 times faster than TweetNaCL across the board! Surely that should be fast enough?

@CuleX: Maybe mistakenly, I was always under the impression that Monocypher is closer to TweetNaCl than libsodium in the sense that it's just drop in and go, no decisions necessary, no surprises on your average platform.

That wasn't a mistake. We want to keep that. But working on Monocypher made me realise that TweetNaCl, in it's quest for conciseness, was sacrificing performance like nobody's business. Then I went too far, thinking I could achieve the best of both worlds. Still, I came close: Monocypher is just as easy and portable as TweetNaCl (minus C89), and its performance is actually closer to that of optimised Libsodium.

@secworks: But would anybody really dismiss Monocypher because of [speed]? Has that really happened? I'm doubtful a user would even note the processing time.

I've had people say that on Reddit and Hacker News a couple times. I recall at least 2 instances, maybe 3. Here's such a dismissal", and the voting pattern doesn't look good. (I understand performance is not the only reason, but still). That said, I suspect some of this comes from my unhelpful marketing: maybe I'm highlighting the performance difference too much. Maybe I'm ignoring the variety of platforms, and the inability of Libsodium to optimise for them. To be honest, a fair comparison of Monocypher and Libsodium would require a whole blog post. Maybe I should get those considerations out of the front page, and to a dedicated benchmark page. And maybe I should point out the massive performance gap between Monocypher and TweetNaCl, just so I can say "performance good hurr hurr!" (which it is, thank you for reminding me).

More important than current performance though, I don't want to hurt achievable performance (hence #87). I think it is important for users to know that using Monocypher won't block their path to faster solutions if they need them.

So. I'm closing this as "invalid". My mistake. Thank you all.

LoupVaillant commented 6 years ago

Oh, didn't see your comment, @konovod, welcome!

Indeed, despite having closed this bug, I am considering the possibility of providing separate, faster implementations of Monocypher. We could focus on x86_64 performance instead of portability and ease of deployment. And that would serve as a demonstration that performance can be ultra-competitive if it ever needs to be.

mikejsavage commented 6 years ago

despite what I said earlier I do agree that it doesn't belong in monocypher

tbh anyone running monocypher at the scale where 4x perf difference is a big deal (on x86 so it's not like it would be tiny embedded systems) is also going to have deep pockets

wait for someone to say it's seriously a problem (not just reddit/hn junk), and then ask them for money :)

LoupVaillant commented 6 years ago

@mikejsavage, you make sense. Reddit/HN commenters probably shouldn't have the weight I gave them (except Thomas Ptacek, but he's an exception—and never criticised the speed of Monocypher).