LLNL / zfp

Compressed numerical arrays that support high-speed random access
http://zfp.llnl.gov
BSD 3-Clause "New" or "Revised" License
772 stars 158 forks source link

Crashes due to undefined behaviour from signed integer overflow in rev_fwd_lift_int32 #241

Closed mjwillson closed 2 weeks ago

mjwillson commented 2 months ago

Hello,

This bug seems to consistently cause SIGILL crashes for me when compressing float data using reversible encoding. When running under UBSAN (https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html ) the issue is revealed to be a signed integer overflow:

third_party/zfp/src/template/revencode.c:24:5: runtime error: signed integer overflow: 1547771904 - -1409359872 cannot be represented in type 'int32' (aka 'int')                                                 
    #0 0x55d22fcec482 in rev_fwd_lift_int32 third_party/zfp/src/template/revencode.c:24:5                
    #2 0x55d22fceb26d in rev_encode_block_int32_2 third_party/zfp/src/template/revencode.c:60:3          
    #3 0x55d22fceb26d in rev_encode_block_float_2 third_party/zfp/src/template/revencodef.c:78:11        
    #4 0x55d22fceb26d in zfp_encode_block_float_2 third_party/zfp/src/template/encodef.c:98:28           
    #5 0x55d22fceb72a in zfp_encode_block_strided_float_2 third_party/zfp/src/template/encode2.c:50:10   
    #6 0x55d22fd0233f in compress_strided_float_2 third_party/zfp/src/template/compress.c:54:9           
    #7 0x55d22fd01bf2 in zfp_compress third_party/zfp/src/zfp.c:1116:3
...

Is this a bug, or were you relying on wraparound overflow happening silently here? Unfortunately it's undefined behaviour and at least some compilers are going to generate crashing code.

lindstro commented 2 months ago

Indeed, that looks like a faulty assumption that integer overflow wraps. One obvious fix would be to use unsigned arithmetic here, though that requires some changes that propagate across a few functions. I'll see if I can get a fix done in the next few days.

Out of curiosity, what compiler and hardware are you using?

mjwillson commented 2 months ago

Thanks! Hardware nothing unusual (amd64), the compiler is clang, but in google's build environment which is passing in a lot of flags by default and from my experience tends to be quite cautious around undefined behaviour (i.e. err on the side of crashing).

lindstro commented 2 months ago

It seems we're about to open up Pandora's box. These UB issues are not confined just to signed integer overflow in the reversible zfp algorithm, but the same can evidently happen in the far more important non-reversible algorithm. For example, a 1D float block { 7.f, 7.f, 7.f, 7.f } encoded with maxprec = 2 results in decoded input x = 230, y = z = w = 0 to inv_lift(): https://github.com/LLNL/zfp/blob/5c976d8da013988174f931845862b6f94119cade/src/template/decode.c#L8-L34

On line 26, x <<= 1 results in x = 231 (regardless of how we implement this doubling of x; see below), which overflows this 32-bit signed integer. Of course, this is temporary since x -= z would restore x to 230, but by then it's too late.

This is a larger challenge as the non-reversible algorithm relies on left and right shifts of signed quantities, with the assumption that right shifts are arithmetic (which is implementation-defined rather than undefined behavior). Worse yet, there are left shifts by one (to invert right shifts by one), and those are UB on signed integers (who knew?!). We can address those by replacing x <<= 1 with x *= 2 or x += x, but we still need to avoid integer overflow.

One potential solution might be to consolidate x += x; x -= z; as x += x - z, though additional work would be needed to determine whether that is safe. Clearly, if neither z += x nor x -= z overflows, then this solution ought to be valid since x - z is simply -z before z += x was executed. I believe we should never have z = -231 as input, which (using two's complement) would overflow when negated, but if that were a concern, then x -= z - x might do the trick. One drawback of this approach is that we have now left the realm of lifting, but so be it.

We could also implement this with additional variables:

  Int x2 = x1 - z1; Int z2 = z1 + x1;

Or using a temporary:

  Int t = x; x -= z; z += t;

If the above still does not prevent overflow, then the only solution I can think of is to perform all arithmetic using unsigned integers, but that introduces complications with implementing arithmetic right shifts. Instead of x >>= 1, we'd have to use some contrived code like x = x & 0x80000000u ? ~(~x >> 1) : x >> 1, or perhaps x = (x & 0x80000000u) | (x >> 1). The cleanest would be to use sal() and sar() macros or functions for arithmetic shifting, but I am concerned about how this would impact performance and code readability. Of course, that is to be preferred over UB, though in practice we've gone 10 years without anyone noticing this in C and C++ implementations (and more recently in CUDA, HIP, and SYCL) using many different compilers and platforms. I guess it's hard to find an architecture that does not use two's complement and arithmetic right shifts with signed integers these days.

Let me think about how to best address this. It unfortunately seems to be a larger issue than what can be addressed in a couple of days.

mjwillson commented 2 months ago

Thank you and sorry about the can of worms!

For now we are able to work around this by compiling with -fwrapv -fno-sanitize=signed-integer-overflow, I believe these are supported by clang and gcc but have only tried with clang.

-fwrapv is "Treat signed integer overflow as two’s complement": https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-fwrapv https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html#:~:text=fwrapv

Ideally I think it would be better to implement in a portable way, but if these flags are needed then good to document this and add to your build config here, ideally only for a minimal required set of files.

lindstro commented 2 months ago

I believe this should be fixed in zfp, as valid input data should never invoke UB. This is somewhat different from #242, where the user violates zfp's assumptions on all-numerical inputs. I'm, however, unsure about the best path forward. My sense is that we need to do some work to prove that any one of the suggested changes above is guaranteed to avoid UB, but it will take some time. Fortunately we have very smart people at LLNL working on exactly such correctness problems, and I'm hoping we can solicit their help.

lindstro commented 1 month ago

@mjwillson: #241 is being addressed on the bugfix/integer-overflow branch. Please check if the fix works for you.

We're currently struggling with tests passing in general because of outdated scripts and some LLNL internal issues with GitLab. We hope to resolve these issues in the coming weeks.

lindstro commented 1 month ago

@mjwillson I would like to merge this fix into develop, but I would prefer to hear from you first to make sure it addresses all of your concerns.

lindstro commented 2 weeks ago

I have merged this fix and am closing the issue. Please reopen if it does not address your needs.