dvidelabs / flatcc

FlatBuffers Compiler and Library in C for C
Apache License 2.0
631 stars 180 forks source link

Undefined behaviour found by clang sanitizer #236

Closed MartinKlang closed 8 months ago

MartinKlang commented 2 years ago

Running the test suite with the clang Undefined Behaviour sanitizer I'm getting a bunch of errors:

flatcc/src/compiler/parser.c:1276:19: runtime error: applying zero offset to null pointer
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior flatcc/src/compiler/parser.c:1276:19 in 
flatcc/external/hash/cmetrohash.h:62:22: runtime error: load of misaligned address 0x562a0e5a99fe for type 'uint32_t' (aka 'unsigned int'), which requires 4 byte alignment
0x562a0e5a99fe: note: pointer points here
 6e 75 6d 20 43 6f  6c 6f 72 3a 62 79 74 65  20 7b 20 52 65 64 20 3d  20 30 2c 20 47 72 65 65  6e 2c
             ^ 

and many more.

I'm configuring and compiling like this:

$ mkdir build && cd build
$ LDFLAGS="-fsanitize=undefined" CFLAGS="-fsanitize=undefined" cmake -DFLATCC_TEST=ON ..
$ make

The reason this came up is that we use sanitizers to check our project code, one part of which uses flatcc. When flatccrt is compiled with the same flags, the following error pops up:

flatcc/src/runtime/builder.c:208:13: runtime error: applying zero offset to null pointer
    #0 0x5f5952 in refresh_ds flatcc/src/runtime/builder.c:208:13
    #1 0x60106c in flatcc_builder_start_table flatcc/src/runtime/builder.c:1099:5
    ...
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior flatcc/src/runtime/builder.c:208:13

I'm using flatcc master branch as of today, on Ubuntu Linux with clang-14.

The cmetrohash functions in particular appear to do some crazy bit trickery that is raising red flags in the sanitizers, also when using -fsanitize=address and -fsanitize=integer. I assume this is tested, safe code. Do you have a policy for supporting, or not, this kind of error detection?

mikkelfj commented 2 years ago

I haven't spent much time with sanitizers. I did look at leak detection, but there were too many false positives to be worthwhile.

cmetrohash is a widely used hash function, but some years old. I doubt it would cause any problems, but it can be trivially replaced by something else. I'm not sure it is worthwhile for flatcc, but if is very important in your project, you can just drop in another hash - I do have some you could test - possibly xxhash32 or xxhash64. Generally the issue is how words a loaded. They must be cast to char * and/or memcpy into a target word - I haven't looked at the metrohash code.

I did have a branch where I converted all runtime type casts to memcpy because this is formally the only portable way although flatcc uses union conversion which is also widely accepted and very unlikely to break. In the end it got too messy and there was a risk of poor performance on old compilers, so I dropped it.

I have not looked at all your locations, but the builder line 208 uses ds_ptr which is implemented as: https://github.com/dvidelabs/flatcc/blob/5ff2e604a3e5a44d3885ff6887890c689b0a6b59/src/runtime/builder.c#L180-L181 So I am guessing that the sanitizers does not like pointer math in cases where the base pointer is a null pointer, which could happen after an allocation failure. That should be a non-issue if allocation checks are in place, which they should be. I'm not sure I want to rearchitect all of the flatcc to work around such issues.

Unaligned reads and writes are generally accessed through https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/portable/punaligned.h And in runtime through https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/flatcc_accessors.h

As to reported locations in the builder: I suspect you will also find something in the verifier and in generated code for reading buffers. These typically use the above accessor methods.

As to policy for flatcc, there are no firm rules, but guidelines are:

If you contribute fixes make sure to run scripts/test.sh.

mikkelfj commented 2 years ago

So my take on this, and probably any project using sanitizers: It may be good to run the checks to see if there are any problem areas, but to expect all code to pass sanitizer checks would be impractical - it takes us into C++ territory where most of the development time is spent on making the compiler happy - this is why git and not monotone became the dominating distributed source control system.

MartinKlang commented 2 years ago

Hi @mikkelfj, thank you for your reply, especially the clarification on your view on sanitisers. Different projects take different views on these things. I will do some more investigations on my end. If there's a simple fix to the runtime issue then I'll submit a PR.

mikkelfj commented 2 years ago

BTW: you could experiment with cast to size_t in the pointer math macro. I think the type is already size_t, but who knows.

r-barnes commented 2 years ago

I am also seeing undefined behaviour from the lines

 #define T_ptr(base, pos) ((void *)((uint8_t *)(base) + (uoffset_t)(pos))) 
 #define ds_ptr(pos) (T_ptr(B->buffers[flatcc_builder_alloc_ds].iov_base, (pos))) 

in builder.c. Often an early exit or a ternary operator to check for null can prevent this.

mikkelfj commented 2 years ago

Often an early exit or a ternary operator to check for null can prevent this.

But if the compiler cannot statically figure that there is no null pointer in the code, then a ternary will be a serious bottleneck, predictive branching or not.

There might of course be a bug where allocation is not checked properly, but otherwise I am not sure how to fix this. Basically the sanitizer is not smart enough, and the overhead of a runtime check at this location in the code is not acceptable. It is used everywhere, all the time.

r-barnes commented 2 years ago

@mikkelfj: What you're saying makes sense; however, this is a run-time issue, so a fix is only needed on the code path that gets us to the point I mentioned above. I should have provided this info from the get-go. Specifically, I the traceback I see is:

flatcc/src/runtime/builder.c:200:13: runtime error: applying zero offset to null pointer
    #0 0x7f8e226df652 in refresh_ds flatcc/src/runtime/builder.c:200
    #1 0x7f8e226e3fe5 in flatcc_builder_start_table flatcc/src/runtime/builder.c:1069

Printing the value of B->ds_first in refresh_ds shows that it sometimes has NULL as a value. So, I think:

    if(B->ds_first){
        B->ds = ds_ptr(B->ds_first);
        B->ds_limit = (uoffset_t)buf->iov_len - B->ds_first;
    }

is sufficient to ensure that this undefined behaviour doesn't occur. (Though I see other such UBN after looking into this one - I'm working through them now.)

r-barnes commented 2 years ago

I replaced a few

return B->ds + offset;

which were causing issues with

assert(B->ds != 0 || offset == 0); // If the data structure doesn't exist, offset should be zero
return B->ds ? B->ds + offset : NULL;

The assertion caused issues, so I commented it out.

This gets me to a new issue:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==1746790==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000004 (pc 0x7f0d033b088e bp 0x7ffd2bae1250 sp 0x7ffd2bae0ea0 T0)
==1746790==The signal is caused by a READ memory access.
==1746790==Hint: address points to the zero page.
SCARINESS: 10 (null-deref)
    #0 0x7f0d033b088e in flatcc_builder_create_table flatcc/src/runtime/builder.c:1244
    #1 0x7f0d033b3021 in flatcc_builder_end_table flatcc/src/runtime/builder.c:1325

I've adjusted the line numbers above to match the original, unmodified code, but it looks like this line is an issue:

        offset = *offset_field - base - offsets[i] - field_size;

At this point I agree with the sanitizer that things are getting a little scary since null dereferencing can be used in DoS and privilege escalation attacks (link).

mikkelfj commented 2 years ago

@r-barnes So you are saying the problem is (likely) here:

https://github.com/dvidelabs/flatcc/blob/5ff2e604a3e5a44d3885ff6887890c689b0a6b59/src/runtime/builder.c#L1274

when called from the function flatcc_builder_end_table?

I think I would need a minimum reproducible setup that works with clang 13.0.0 on MacOS, preferably with debug assertions added that highlight the problem before a sanitizer needs to raise an alert.

As to refresh_ds: Maybe it is worth figuring that out first, because the second problem (in create_table) might be a consequence of ignoring the first (in refesh_ds).

Ah wait: reading my own comment:

assert(B->ds != 0 || offset == 0); // If the data structure doesn't exist, offset should be zero
return B->ds + offset;

I think what happens here is that the code is optimized to return NULL if the ds stack is currently unallocated without having to do an extra runtime check. The offset should in this case be zero for this to work. This is what the assertion claims, and since the assertion presumably does not kick in if people are using this in debug builds, that is correct so far.

However It is not valid to add zero to a null pointer, and this is what the sanitizer complains about.

I am curious why the assertion causes problems? What is it doing wrong? Or is it in fact raising an assertion because the invariant has been broken?

If I remember correctly, refresh_ds is not called very often. I think it it reserves some space on a stack with enough space to handle a single table without additional checks, so likely called about once per table. In contrast ds_ptr is used all the time.

So I don't yet have an answer to how to fiks this without a performance penalty, but I am open to suggestions. One solution might be:

return (uint8_t *)((size_t)B->ds + (size_t)offset); /* Cast to avoid UB warning on zero added to NULL. */

This also makes it unlikely to cause the second problem in create_table.

mikkelfj commented 2 years ago

To the original issue:

#define T_ptr(base, pos) ((void *)((uint8_t *)(base) + (uoffset_t)(pos))) 

This can likely be fixed using

#define T_ptr(base, pos) ((void *)((size_t)(base) + (size_t)(uoffset_t)(pos))) 

or

#define T_ptr(base, pos) ((void *)((size_t)(uint8_t *)(base) + (size_t)(uoffset_t)(pos))) 
mikkelfj commented 2 years ago

So I don't recall all details, but I think it is worth noting for context, that the builder operates on a number of small stacks and other temporary. Pointers are avoided because these stacks may have to be reallocated. This is why offsets are being added in T_ptr and other places. Not all of these datastructures are necessarily required and may remain unallocated until required which happens via a "is there enough space on stack" check. By adding zero offsets to unused stacks some special cases can be avoided.

Historically: I think some stacks are now allocated with a minimum size even though not required because it became too tedious to deal with, but that was kind of a lazy hack to avoid making the code robust against special cases that were not properly handled.

This comment does not relate to the issue at hand directly, it is just to help provide some insights for anyone contributing.

r-barnes commented 2 years ago

I'm worried your proposed casting solution might sweep a real problem under the rug.

I actually added the assertion

assert(B->ds != 0 || offset == 0); // If the data structure doesn't exist, offset should be zero

to highlight the problem. If it were always the case that B->ds being null meant the offset being added was 0, then things would be fine. As it is, sometimes B->ds is null and the offset is not zero. This is problematic.

If I change CMakeLists.txt to include

if (CMAKE_C_COMPILER_ID MATCHES "Clang")
    set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fsanitize=undefined")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=undefined")

then the run the tests, I do see undefined behaviour. I'll open a reproducing PR.

r-barnes commented 2 years ago

237 adds the undefined behaviour sanitizer to the Debug build mode and fixes the three instances of undefined nullptr math this reveals. This fixes the UBN issue in both flatcc's test suite as well as my internal one.

I do see instances of misaligned address loads that I'm not sure how to fix.

mikkelfj commented 2 years ago

As it is, sometimes B->ds is null and the offset is not zero. This is problematic.

Yes it is.

I'm worried your proposed casting solution might sweep a real problem under the rug.

Not if assertions are in place.

Thanks for adding PR, I'll have a look.

I do see instances of misaligned address loads that I'm not sure how to fix.

Can you detail more? Generally FlatBuffers should be aligned, but the issue could also originate in the compiler source. I already mentioned how unaligned are handled in the portable library in my first comment to this issue. Maybe you are hitting the Metrohash issue that OP mentioned? We can add xxhash if important and if it helps.

r-barnes commented 2 years ago

The misaligned loads I'm seeing look similar to OP's. There are many, but they are all similar to:

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /home/rick/projects/contributing/flatcc/external/hash/cmetrohash.h:67:22 in 
[50/122] Running utility command for gen_monster_test_json
/home/rick/projects/contributing/flatcc/external/hash/cmetrohash.h:57:22: runtime error: load of misaligned address 0x559822b944ee for type 'uint64_t' (aka 'unsigned long'), which requires 8 byte alignment
0x559822b944ee: note: pointer points here
 6e 75 6d 20 46 72  6f 6d 49 6e 63 6c 75 64  65 3a 6c 6f 6e 67 20 7b  20 49 6e 63 6c 75 64 65  56 61
             ^ 
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /home/rick/projects/contributing/flatcc/external/hash/cmetrohash.h:57:22 in 
/home/rick/projects/contributing/flatcc/external/hash/cmetrohash.h:67:22: runtime error: load of misaligned address 0x559822b944c1 for type 'uint16_t' (aka 'unsigned short'), which requires 2 byte alignment
0x559822b944c1: note: pointer points here
 72 69 62  75 74 65 22 3b 0a 0a 6e  61 6d 65 73 70 61 63 65  20 4d 79 47 61 6d 65 2e  4f 74 68 65 72
              ^ 

If you clone https://github.com/r-barnes/flatcc and run

mkdir build
cd build
CXX=clang++ CC=clang cmake -DCMAKE_BUILD_TYPE=Debug ..
make

That should reproduce them (at least with clang-14).

mikkelfj commented 1 year ago

@r-barnes Thanks, as you can see in above reference there is a new issue with link time optimizations (LTO) affecting either in cmetrohash or in interfacing code. I added a tmp. hash branch where xxhash can be used for experimentation, but it is not integrated. You are welcome to also give it a go if you like.

mikkelfj commented 8 months ago

I just made a commit to master branch that changes the line mentioned earlier: #define T_ptr(base, pos) ((void *)((uint8_t *)(base) + (uoffset_t)(pos))) such that it casts to size_t before summing, then to void *.

As to metrohash, I keep hearing about it but I don't get any warnings even though the sanitizer flag is now added to debug as per input from @r-barnes in separate PR.

I tried the branch https://github.com/r-barnes/flatcc and build it as described above. There are some new warnings as errors that needs to be fixed, and have been on the flatcc master branch (unused assignment, main() vs main(void)), but I get nothing on cmetrohash.

This is with clang 15 on MacOS arm64, but the change I made with the sanitize flag on the master branch has also been running many other clang versions.

metrohash can be replaced with xxhash if it makes a difference, but currently I cannot reproduce.

mikkelfj commented 8 months ago

A while back I made the branch hash that introduces xxhash but got no feedback in #238 . I have now updated that branch by merging master into it.

mikkelfj commented 8 months ago

I'm closing this as I cannot reproduce and got no feedback. It may be that the hash gets updated as some point.