Open gmaxwell opened 4 years ago
My first thought was: Maybe first do the two ORs and only afterwards do the AND with the result and ~no
?
ps: It even saves an op (when compiled naively)!
Thanks for the suggestion. I already tried that. :( Didn't help. Though it didn't occur to me that it might be better generally. Sounds like a fine change. :P
Another volatile
does the trick on s390x. The difference on other platforms seems minimal:
https://godbolt.org/z/9rxTP7
With this fix applied, we have the same issue in secp256k1_scalar_is_high
, which is very similar in structure, so the same fix should apply...
I'm somewhat reluctant to make the code more complex for s390x but it would enable us to enable us the tests there, and this may help packaging, see the discussion in #723.
On the way, I wondered why some of the short functions are inline
and some are not. For example secp256k1_scalar_check_overflow
is and secp256k1_scalar_is_high
and secp256k1_scalar_negate
are not. This is now more relevant with GCC on -O2 whereas automatic inlining is disabled.
Great find!
ugh. maybe it's just me but casting to volatile just seems even more awful than having a volatile temporary that copies an input as we've done in some other places. Also-- there are integer comparisons all over the code base and AFAIK the next version of the compiler will just create this problem for other ones.
I also feel like if I prompted GPT3 with a story about peppering our source code with volatile casts it would likely continue the story by saying and that's how we found 27 compiler bugs...
I'm going to see if I can solicit some help from a s390x expert or gcc developer.
Having a BE test in travis would be really good. But unlike some of the other constant time hacks in the codebase, I think this one is unlikely to help any other cpus/compilers or even silence the issue completely on s390x.
ugh. maybe it's just me but casting to volatile just seems even more awful than having a volatile temporary that copies an input as we've done in some other places.
Well yeah. The cast variant is in fact faster here (typically one instruction) because then only the read is volatile and not the store in the temporary.
But I was somewhat surprised that this works here. In the existing instances volatile
hinders value analysis. It seems that GCC prefers clc
when it can avoid load instructions. So if neither operand are in a register yet, then clc
is used, and volatile
just changes this.
Also-- there are integer comparisons all over the code base and AFAIK the next version of the compiler will just create this problem for other ones.
This may just be true, yeah.
I also feel like if I prompted GPT3 with a story about peppering our source code with volatile casts it would likely continue the story by saying and that's how we found 27 compiler bugs...
Hehe, maybe. Let's also add a few more restrict
s.
I'm going to see if I can solicit some help from a s390x expert or gcc developer.
This would be neat.
Having a BE test in travis would be really good.
Yes, and we don't need the ct test for this anyway.
But unlike some of the other constant time hacks in the codebase, I think this one is unlikely to help any other cpus/compilers
Agreed.
or even silence the issue completely on s390x.
Hm, I believe it would. At least on that GCC config, the one in this secp256k1_scalar_is_high
was the only remaining branch...
edit: strike through totally wrong assumption, I misremembered this.
Funnily enough, it's also enough to make SECP256K1_N_0
a volatile const
.
Funnily enough, it's also enough to make
SECP256K1_N_0
avolatile const
.
I have rarely felt the need for a "puke" reaction emoji on Github.
That's probably not helpful but I forgot to mention it in my comment yesterday: we don't even know if the clc
instruction is variable-time or not in the CPU. (We may be able to benchmark it but I don't care in the end.)
Valgrind models clc
as comparing byte-by-byte and conditionally branching to the next instruction if the two compared bytes are not equal. And this is the most precise way to model it. For example, the program could be comparing two strings of length 5 bytes where only the first 4 bytes are initialized and it's always guaranteed that the 2nd bytes differ. Then the program will be correct as it never branches on the 5th byte. But that does not mean that the CPU implements it that way same, in particular in that case where the length of the compared byte string is 8, which is just the native register size of the CPU...
I found docs that said it was variable time at least on one older cpu in that line. As in it said that it was x NS + bytes * y NS. You're right thought that it could be different on more recent chips-- this is where talking to an expert would help, but even still, we'd be left with a valgrind FP so that wouldn't help us very much.
And it supports up to 255 byte long comparisons, which is unlikely to be running in constant time if it can bail out early.
Of course, it could work in a quantized way, comparing groups of say 8 bytes at a time, and comparisons of up to 8 would be ct then.
This is an odd architecture and is mostly interesting here just because its the only BE system available in the CI system.
It compiles the following simple code into variable time code:
The same thing happens in the analogous part of the 32-bit version of the function.
I tried all manner of compiler switches and could only make the situation worse-- with other similar comparisons becoming variable time.
The issue is that the architecture has an instruction which works functionally like memcmp, and the compiler will sometimes emit it for seemingly arbitrary comparisons.
I tried several different forms for the code but couldn't get it to stop. I don't expect anyone working on this repository to do anything about it now, but I'm planning on opening a GCC bug report and want something to point to.