bitcoin-core / secp256k1

Optimized C library for EC operations on curve secp256k1
MIT License
2.09k stars 1.01k forks source link

secp256k1_scalar_check_overflow is not constant time on S390 #784

Open gmaxwell opened 4 years ago

gmaxwell commented 4 years ago

This is an odd architecture and is mostly interesting here just because its the only BE system available in the CI system.

It compiles the following simple code into variable time code:

    yes |= (a->d[0] >= SECP256K1_N_0) & ~no;

The same thing happens in the analogous part of the 32-bit version of the function.

I tried all manner of compiler switches and could only make the situation worse-- with other similar comparisons becoming variable time.

The issue is that the architecture has an instruction which works functionally like memcmp, and the compiler will sometimes emit it for seemingly arbitrary comparisons.

   0x00000000040483e8 <+296>:   b9 04 00 49 lgr %r4,%r9
   0x00000000040483ec <+300>:   b9 98 00 55 alcr    %r5,%r5
=> 0x00000000040483f0 <+304>:   d5 07 f0 a8 d0 18   clc 168(8,%r15),24(%r13)
   0x00000000040483f6 <+310>:   b9 94 00 22 llcr    %r2,%r2
   0x00000000040483fa <+314>:   e5 48 f0 c0 00 00   mvghi   192(%r15),0

I tried several different forms for the code but couldn't get it to stop. I don't expect anyone working on this repository to do anything about it now, but I'm planning on opening a GCC bug report and want something to point to.

real-or-random commented 4 years ago

My first thought was: Maybe first do the two ORs and only afterwards do the AND with the result and ~no?

ps: It even saves an op (when compiled naively)!

gmaxwell commented 4 years ago

Thanks for the suggestion. I already tried that. :( Didn't help. Though it didn't occur to me that it might be better generally. Sounds like a fine change. :P

real-or-random commented 4 years ago

Another volatile does the trick on s390x. The difference on other platforms seems minimal: https://godbolt.org/z/9rxTP7

With this fix applied, we have the same issue in secp256k1_scalar_is_high, which is very similar in structure, so the same fix should apply... I'm somewhat reluctant to make the code more complex for s390x but it would enable us to enable us the tests there, and this may help packaging, see the discussion in #723.

On the way, I wondered why some of the short functions are inline and some are not. For example secp256k1_scalar_check_overflow is and secp256k1_scalar_is_high and secp256k1_scalar_negate are not. This is now more relevant with GCC on -O2 whereas automatic inlining is disabled.

gmaxwell commented 4 years ago

Great find!

ugh. maybe it's just me but casting to volatile just seems even more awful than having a volatile temporary that copies an input as we've done in some other places. Also-- there are integer comparisons all over the code base and AFAIK the next version of the compiler will just create this problem for other ones.

I also feel like if I prompted GPT3 with a story about peppering our source code with volatile casts it would likely continue the story by saying and that's how we found 27 compiler bugs...

I'm going to see if I can solicit some help from a s390x expert or gcc developer.

Having a BE test in travis would be really good. But unlike some of the other constant time hacks in the codebase, I think this one is unlikely to help any other cpus/compilers or even silence the issue completely on s390x.

real-or-random commented 4 years ago

ugh. maybe it's just me but casting to volatile just seems even more awful than having a volatile temporary that copies an input as we've done in some other places.

Well yeah. The cast variant is in fact faster here (typically one instruction) because then only the read is volatile and not the store in the temporary.

But I was somewhat surprised that this works here. In the existing instances volatile hinders value analysis. It seems that GCC prefers clc when it can avoid load instructions. So if neither operand are in a register yet, then clc is used, and volatile just changes this.

Also-- there are integer comparisons all over the code base and AFAIK the next version of the compiler will just create this problem for other ones.

This may just be true, yeah.

I also feel like if I prompted GPT3 with a story about peppering our source code with volatile casts it would likely continue the story by saying and that's how we found 27 compiler bugs...

Hehe, maybe. Let's also add a few more restricts.

I'm going to see if I can solicit some help from a s390x expert or gcc developer.

This would be neat.

Having a BE test in travis would be really good.

Yes, and we don't need the ct test for this anyway.

But unlike some of the other constant time hacks in the codebase, I think this one is unlikely to help any other cpus/compilers

Agreed.

or even silence the issue completely on s390x.

Hm, I believe it would. At least on that GCC config, the one in this secp256k1_scalar_is_high was the only remaining branch...

edit: strike through totally wrong assumption, I misremembered this.

real-or-random commented 4 years ago

Funnily enough, it's also enough to make SECP256K1_N_0 a volatile const.

sipa commented 4 years ago

Funnily enough, it's also enough to make SECP256K1_N_0 a volatile const.

I have rarely felt the need for a "puke" reaction emoji on Github.

real-or-random commented 4 years ago

That's probably not helpful but I forgot to mention it in my comment yesterday: we don't even know if the clc instruction is variable-time or not in the CPU. (We may be able to benchmark it but I don't care in the end.)

Valgrind models clc as comparing byte-by-byte and conditionally branching to the next instruction if the two compared bytes are not equal. And this is the most precise way to model it. For example, the program could be comparing two strings of length 5 bytes where only the first 4 bytes are initialized and it's always guaranteed that the 2nd bytes differ. Then the program will be correct as it never branches on the 5th byte. But that does not mean that the CPU implements it that way same, in particular in that case where the length of the compared byte string is 8, which is just the native register size of the CPU...

gmaxwell commented 4 years ago

I found docs that said it was variable time at least on one older cpu in that line. As in it said that it was x NS + bytes * y NS. You're right thought that it could be different on more recent chips-- this is where talking to an expert would help, but even still, we'd be left with a valgrind FP so that wouldn't help us very much.

sipa commented 4 years ago

And it supports up to 255 byte long comparisons, which is unlikely to be running in constant time if it can bail out early.

Of course, it could work in a quantized way, comparing groups of say 8 bytes at a time, and comparisons of up to 8 would be ct then.