dvidelabs / flatcc

FlatBuffers Compiler and Library in C for C
Apache License 2.0
631 stars 180 forks source link

Anti-Aliasing Bug (Initialize pre-generated message, then modify it) #274

Closed RafaelLaya closed 3 months ago

RafaelLaya commented 4 months ago

bug_for_flatcc.zip

Hi everyone! This is my first bug report contribution into the FlatCC project. Thanks for making such an awesome project! Here are the details:

Files attached

Explanation of the source code provided in tempfiles.c

The code in question is in the function interface_encoding_populate_general_rsp(). writes a message determined at compile-time into an array in the stack, and then finishes the message and copies into a buffer given by the caller.

If one steps through the code, essentially the following happens:

From the initialization of msg_array we can see everything should be initialized correctly, then Response_as_root_v2() and Reponse_data_v2() will work exactly as described above.

Actual Issue

The compiler smartly determines that it needs to look at msg_array[4] and check if it is less than 8. However, it reads msg_array[4] into R5 which has garbage in the stack. Then it initializes msg_array in the stack, and then compares R5 with 7 (branches to assert if less-or-equal). The problem is that by the time the comparison is done, the array is partly initialized and R5 has garbage from the stack before the initialization of the array

a: f8bd 5004 ldrh.w r5, [sp, #4] e: cc0f ldmia r4!, {r0, r1, r2, r3} 10: e8ac 000f stmia.w ip!, {r0, r1, r2, r3} 14: cc0f ldmia r4!, {r0, r1, r2, r3} 16: 2d07 cmp r5, #7

Schema

The relevant portion looks like this: A table that requires an union inside, and the union has one of many tables

table MyTableResponse{ // Some integers } union ResponseData { // A bunch of tables /// Generic response (success w/ no payload or error). MyTable: MyTableResponse, }

table Response { data: ResponseData (required); }

Background

I found this by Aardappel which seems to be an user on stackoverflow who is highly involved with the project: https://stackoverflow.com/questions/24330925/does-flatbuffers-avoid-strict-aliasing-somehow

Then in theory, if you first construct a FlatBuffer and then immediately read it, it could optimize across the writing and reading code, and do "evil" optimisations of the kind Linus was referring to in your link above (pretend the writing never happened).

Workarounds

RafaelLaya commented 4 months ago

Actually taking again a look at this, this line looks dangerous:

vt__tmp = (flatbuffers_voffset_v2_t*)((uint8_t*)(t__tmp)-__flatbuffers_soffset_read_from_pe_v2(
    t__tmp));

  The table is uint8_t* but we cast into a uint16_t* type, so perhaps this is where the compiler assumes it can't alias? 
mikkelfj commented 4 months ago

OK, I'd like to look into this issue if there is anything I can do. So I don't want to reject this up front. But I am not convinced that this is a FlatCC issue.

I cannot easily reproduce your project because I do not have the tool chain, nor necessarily the hardware, you are using. What I need is a schema and C source that can be compiled, and which can trigger warnings about aliasing, possibly by adding more flags to the clang compiler that I use on macOS. Alternatively that you do the same on your platform and list the warnings. If warnings cannot be generated, I'm inclined to assume it is a compiler optimization bug.

Now, given that I don't have enough information, I can speculate:

First, as Wouter AardAppel writes in the link you reference, reading is by definition read only, so aliasing should not result in different data unless you have UB (undefined behaviour) in the implementation.

I am not sure where the problem is exactly, other than it sounds like a vtable lookup issue, but if so, it likely happens many other places for similar reasons, so I'll take this is a symptom, and use the line you posted in your second comment.

In C it is (relatively recently) undefined behaviour to cast a pointer to a type it is not, except if the pointer is 'char *' which is explicitly valid, and necessary, to do pointer math.

Now, this problem is mostly with accessing data the compiler generates, but here we are reading static data from a buffer, so the compiler cannot shuffle things around. Also, we are casting to uint8_t which for all practical purposes is the same as 'char ' and any compiler that cannot handle that, should be fixed as there is too much such code around.

After doing pointer math, the code is cast to the underlying correct type, and if you do things right on your side, it is also correctly aligned in memory. However, if your buffer is not aligned properly, some architectures might run slower, issue access violations, or do other unexpected stuff.

For this reason, the code you reference is not dangerous, except if accessing memory that is not properly aligned.

And as to GCC, I have already removed support for GCC pedantic warnings because they break portable C code, and if this turns to not be easily managed in FlatCC, it may require further downgrading GCC support, e.g. removing optimizations in recommended builds.

mikkelfj commented 4 months ago

For reference below is the code that is being discussed.

I link to the "reflection" code which is generated code that is checked into the repo, and the file is same for all flatcc projects of the same version, so this is not related to the reflection schema in particular.

https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/reflection/flatbuffers_common_reader.h

Early in the file we have the macro

#define __flatbuffers_read_vt(ID, offset, t)\
flatbuffers_voffset_t offset = 0;\
{   flatbuffers_voffset_t id__tmp, *vt__tmp;\
    FLATCC_ASSERT(t != 0 && "null pointer table access");\
    id__tmp = ID;\
    vt__tmp = (flatbuffers_voffset_t *)((uint8_t *)(t) -\
        __flatbuffers_soffset_read_from_pe(t));\
    if (__flatbuffers_voffset_read_from_pe(vt__tmp) >= sizeof(vt__tmp[0]) * (id__tmp + 3u)) {\
        offset = __flatbuffers_voffset_read_from_pe(vt__tmp + id__tmp + 2);\
    }\
}

What cannot see from this is arguments being use to call this macro. If there are aliasing issues, this could be relevant. You could try to make the macro a static inline function which should optimize the same but perhaps better isolate contexts for the compiler.

However, the macro is already being used in such inline functions, it seems overkill to add yet another layer which is why it is not.

Here is an example from the same file:

#define __flatbuffers_define_scalar_field(ID, N, NK, TK, T, V)\
static inline T N ## _ ## NK ## _get(N ## _table_t t__tmp)\
{ __flatbuffers_read_vt(ID, offset__tmp, t__tmp)\
  return offset__tmp ? __flatbuffers_read_scalar_at_byteoffset(TK, t__tmp, offset__tmp) : V;\
}\
mikkelfj commented 4 months ago

Here is a weekly build that shows all the GCC versions being tested and pass. But that is for whatever architecture Github Actions use:

https://github.com/dvidelabs/flatcc/actions/runs/8047226724

mikkelfj commented 4 months ago

Further: the macro I mentioned calls a function __flatbuffers_soffset_read_from_pe(t) that you example line also calls as it is the same logic.

This function is defined in the flatcc_accessors file and use union cast which is generally believed to be a valid cast, at least respected by all living compilers. It could be argued that memcpy should be used, and I did work an that, but that got rather invasive and not worthwhile. Historically, memcpy would be slow, but lately it is understood to be a cast operator by most compilers.

https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/flatcc_accessors.h

You could try to play around with different cast implementations, or replace the read function.

The __flatbuffers_soffset_read_from_pe(t) function takes a pointer to an aligned 32-bit memory address and returns a signed 32-bit integer (int32_t). If the platform is big-endian, it also swaps the bytes before returning the result.

pe means protocol endian, meaning it converts between host and the protocol endianness, which is little endian for flatbuffers, but other formats can also be supports.

RafaelLaya commented 4 months ago

Thanks! You are right, I could've done a better job producing a seamless repro. Here it is https://github.com/RafaelLaya/flatcc/pull/1 I forked FlatCC and modified the Monster example to show this. You should be able to cd mymonster && scripts/build.sh && ./build/mymonster and it should work. Just make sure to have arm-none-eabi-gcc available in your system. It will run on your system and work fine, but it will also build for ARMv7 and produce out.lst which is the dissasembly. From the dissasembly, we can clearly see the compiler produces code where we will be reading garbage from the stack and making decisions based on this garbage (which leads to an assertion)

Thanks! Please let me know if you come to any conclusions

mikkelfj commented 4 months ago

I had a look at your PR but don't currently have a tool chain to test with. I will see what I can do. Meanwhile, please consider experimenting with replacing functions in the accessors and common files discussed above. This is where a possible FlatCC fix would happen.

That said, it really looks like a compiler bug but I will not rule out anything since these things are notoriously tricky.

I cannot see any scenerio where the compiler should think it is a good idea to dereference memory it is responsible for initializing before it has initialized that memory area, which is what it does when dereferencing the stack pointer after reserving space. Of course someone could argue that the optimizer need not check for that if the code it compiles is correct, but I find that a hard sell, and the code works on numerous other platforms, also with GCC compiler.

In the PR you have a correctly aligned memory buffer that is compiler initialized. Aside from the unfortunately named builder_t type that should have been buffer_t or something it looks OK in source.

Just to explain what happens in the buffer content and what the C source is supposed to do: offset 0: 4, 0, 0, 0 : find the root table at offset 4 by adding to the buffer start location (which must be aligned). Now we have get_as_root. offset 4: negative 4 byte vtable offset. Subtract from current offset 4: 4 - (230 - 256) = 30. offset 30: 8, 0, 9, 0, 8, 0, 4, 0 : this is the vtable in 16-bit elements the first two bytes 8, 0 is the vtable size. The second two bytes 9, 0 is the table size of the referencing table Table field with ID 0 is then voffset 8, 0 relative to the table at offset 4, so field ID zero starts at buffer offset 4 + 8 = 12. Table field with ID 1 is then voffset 4, 0 at buffer offset 4 + 4 = 8

For reference: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#example

The soffset read function discussed above fetches the offset from one field in the vtable, and the compiler is supposed to optimize that read to relative to the vtable start very efficiently. It seems too efficiently in this case.

#define TABLE_PREDETERMINED_MSG  {4, 0, 0, 0, 230, 255, 255, 255, 8, 0, 0, 0, 1, 0, 0, 0, 248, 255, 255, 255, 128, 255, 255, 255, 6, 0, 8, 0, 4, 0, 8, 0, 9, 0, 8, 0, 4, 0, 0, 0}

/* msg_buffer 
typedef struct {
    uint8_t buffer[40] __attribute__ ((aligned (8))); 
} builder_t;

int run_test(void) {
    builder_t msg = {TABLE_PREDETERMINED_MSG};
    ns(Message_table_t) msg_table_root = ns(Message_as_root(msg.buffer));
    ns(CodeTable_table_t) table_ptr = ns(Message_data(msg_table_root));
    (void)table_ptr;
    /* ... */
    return 0;
}
mikkelfj commented 4 months ago

OK, I downloaded the eabi-none toolchain and tested, not the PR but the original zip project you provided as I had that at hand already, and it turns out to be rather useful for modifying the flatcc library code for testing. However, I modified the buffer type to be aligned, as you also have done in the PR branch.

I get the assembly you documented, i.e. I can reproduce.

I then modified the function that reads the soffset to find the vtable location to make it use memcpy:

static inline flatbuffers_soffset_v2_t __flatbuffers_soffset_read_from_pe_v2(const void* p) {
#if 1 /* modified version */
  flatbuffers_soffset_v2_t word;
  memcpy(&word, p, sizeof(word));
  return __flatbuffers_soffset_cast_from_pe_v2(word);
#else /* original flatcc library code */
  return __flatbuffers_soffset_cast_from_pe_v2(*(flatbuffers_soffset_v2_t*)p);
#endif
}

And with the compile flags you provided, that is -O2 enabled, I get the following, which superficially seems to be better. But the memcpy is horribly ineffecient. The code does none of the memcpy is a cast optimizations, which is also one reason I did not make that change to accessors early - too many compilers that might do that slow.

00000000 <interface_encoding_populate_general_rsp>:
   0:   b570        push    {r4, r5, r6, lr}
   2:   4c15        ldr r4, [pc, #84]   @ (58 <interface_encoding_populate_general_rsp+0x58>)
   4:   b08c        sub sp, #48 @ 0x30
   6:   4606        mov r6, r0
   8:   f10d 0c08   add.w   ip, sp, #8
   c:   ad05        add r5, sp, #20
   e:   cc0f        ldmia   r4!, {r0, r1, r2, r3}
  10:   e8ac 000f   stmia.w ip!, {r0, r1, r2, r3}
  14:   cc0f        ldmia   r4!, {r0, r1, r2, r3}
  16:   e8ac 000f   stmia.w ip!, {r0, r1, r2, r3}
  1a:   e894 0003   ldmia.w r4, {r0, r1}
  1e:   2204        movs    r2, #4
  20:   e88c 0003   stmia.w ip, {r0, r1}
  24:   4629        mov r1, r5
  26:   eb0d 0002   add.w   r0, sp, r2
  2a:   f7ff fffe   bl  0 <memcpy>
  2e:   9b01        ldr r3, [sp, #4]
  30:   1aed        subs    r5, r5, r3
  32:   882b        ldrh    r3, [r5, #0]
  34:   2b07        cmp r3, #7

I would say that this isn't really an acceptable solution.

mikkelfj commented 4 months ago

If I do not modify the soffset reader, but change the build target:

# original flags
#FLAGS="-O2 -Wall -Wextra -ffreestanding -mcpu=cortex-m7 -mthumb -mabi=aapcs -std=c99"

FLAGS="-O2 -Wall -Wextra -ffreestanding -mthumb -mabi=aapcs -std=c99"

then I get the following which looks more sane superficially:

00000000 <interface_encoding_populate_general_rsp>:
   0:   b5f0        push    {r4, r5, r6, r7, lr}
   2:   b08b        sub sp, #44 @ 0x2c
   4:   4669        mov r1, sp
   6:   4c0f        ldr r4, [pc, #60]   @ (44 <interface_encoding_populate_general_rsp+0x44>)
   8:   000b        movs    r3, r1
   a:   0022        movs    r2, r4
   c:   cae0        ldmia   r2!, {r5, r6, r7}
   e:   c3e0        stmia   r3!, {r5, r6, r7}
  10:   cae0        ldmia   r2!, {r5, r6, r7}
  12:   c3e0        stmia   r3!, {r5, r6, r7}
  14:   cae0        ldmia   r2!, {r5, r6, r7}
  16:   c3e0        stmia   r3!, {r5, r6, r7}
  18:   6812        ldr r2, [r2, #0]
  1a:   601a        str r2, [r3, #0]
  1c:   888b        ldrh    r3, [r1, #4]
  1e:   2b07        cmp r3, #7
mikkelfj commented 4 months ago

I also don't get early stack reference with the original build flags, but cortex-m0 or cortext-m1 instead:

FLAGS="-O2 -Wall -Wextra -ffreestanding -mcpu=cortex-m0 -mthumb -mabi=aapcs -std=c99"
#FLAGS="-O2 -Wall -Wextra -ffreestanding -mcpu=cortex-m1 -mthumb -mabi=aapcs -std=c99"
${GCC_PATH}/arm-none-eabi-gcc $FLAGS -o out.obj -c tempfiles.c
mikkelfj commented 4 months ago

Regardless of who is to blame, I needed to warn about this issue in the cross compilation section. https://github.com/dvidelabs/flatcc?tab=readme-ov-file#cross-compilation

mikkelfj commented 4 months ago

I think technically this line violates strict aliasing, but adding attribute((__may_alias__)) does not seem to help regardless of where I put it.

return __flatbuffers_soffset_cast_from_pe_v2(*(flatbuffers_soffset_v2_t*)p);

As mentioned, memcpy seems to work but optimizes poorly. I also tried const char *q = p; return q[0] | (int32_t)(q[1]) << 8 | ...; and this seems to optimize better, but that is a real pain to deal with every time you want to read from a memory buffer. I also tried using a malloc buffer instead of a stack allocated array and this also seems to work better. All of this from looking at assembly as I cannot test the target.

If there is no better way to support reading two bytes from memory without conflicting with the C compiler, I'm not sure it is worth supporting.

mikkelfj commented 4 months ago

Using __builtin_memcpy

static inline flatbuffers_soffset_v2_t __flatbuffers_soffset_read_from_pe_v2(const void* p) {
#if 0
  flatbuffers_soffset_v2_t word;
  memcpy(&word, p, sizeof(word));
  return __flatbuffers_soffset_cast_from_pe_v2(word);
#elif 1
  flatbuffers_soffset_v2_t word;
  __builtin_memcpy(&word, p, sizeof(word));
  return __flatbuffers_soffset_cast_from_pe_v2(word);
#elif 0
  flatbuffers_soffset_v2_t word;
  word = ((int32_t)((const char *)p)[0]) | ((int32_t)((const char *)p)[1]) << 16 |
         ((int32_t)((const char *)p)[2]) << 16 | ((int32_t)((const char *)p)[3]) << 24;
  return word;
#else
  return __flatbuffers_soffset_cast_from_pe_v2(*(flatbuffers_soffset_v2_t*)p);
#endif
}
00000000 <interface_encoding_populate_general_rsp>:
   0:   b510        push    {r4, lr}
   2:   4c0a        ldr r4, [pc, #40]   @ (2c <interface_encoding_populate_general_rsp+0x2c>)
   4:   b08a        sub sp, #40 @ 0x28
   6:   4686        mov lr, r0
   8:   46ec        mov ip, sp
   a:   cc0f        ldmia   r4!, {r0, r1, r2, r3}
   c:   e8ac 000f   stmia.w ip!, {r0, r1, r2, r3}
  10:   cc0f        ldmia   r4!, {r0, r1, r2, r3}
  12:   e8ac 000f   stmia.w ip!, {r0, r1, r2, r3}
  16:   e894 0003   ldmia.w r4, {r0, r1}
  1a:   2228        movs    r2, #40 @ 0x28
  1c:   e88c 0003   stmia.w ip, {r0, r1}
  20:   4669        mov r1, sp
  22:   4670        mov r0, lr
  24:   f7ff fffe   bl  0 <memcpy>
  28:   b00a        add sp, #40 @ 0x28
  2a:   bd10        pop {r4, pc}
  2c:   00000000    .word   0x00000000
mikkelfj commented 4 months ago

I added a new branch fix-accessors with currently two commits https://github.com/dvidelabs/flatcc/tree/fix-accessors

It adds pmemaccess.h to provide __builtin_memcpy and other methods to read and write memory. It also updates flatcc_accessors.h to use these instead of (T )p casts for read and write.

Note that pmemaccess still uses pointer casts on x86/64 since it has worked so far, and memcpy might not work well on all platforms.

There is a related issue with punaligned.h that reads unaligned memory, mostly for JSON parsing. It already does something similar with pointer access on x86/64 and not on other platforms.

mikkelfj commented 4 months ago

Just for the record: @RafaelLaya commented directly on the fix-accessors commits.

Also, I'm going to do some further updates to call mem_copy_read_nn and mem_copy_write_nn directly from flatcc_accessors, instead of calling mem_copy_word directly. This will make it possible to use pointer casts like *(T *)p as before, which is probably better on older MSVC compilers and possible other targets. On ARM / gcc the effect will be the same, calling mem_read_16 -> mem_copy_word -> builtin_copy, but I need to work on it a bit more to pass tests.

The behaviour can also be controlled by -DPORTABLE_MEM_PTR_ACCESS=0 (or 1), and also by defining mem_copy_word before header inclusion. This might be relevant on some targets, since this area of operation is extremely performance sensitive, and also code size sensitive.

mikkelfj commented 4 months ago

I have now update the fix-accessors branch. I needed to add mem_read/write_float_32/64 as well. It passes std CI tests.

If you are happy with the design, can you please confirm that it solves the issue on actual -mcu=cortext-m7 -O2 compiled targets, if possible?

I still think this is a compiler issue, it is not practical to disallow C code with simple pointer memory access, but at least we appear to have a workable solution that is not too ugly.

RafaelLaya commented 4 months ago

Sorry I reviewed last night and forgot to respond here. I like that you added customizability of mem_copy_word and yes I'm happy with the design. I'm sure from the assembly it should be fixed now, but I'll try to check on a target just in case as you suggested

I still think this is a compiler issue, it is not practical to disallow C code with simple pointer memory access, but at least we appear to have a workable solution that is not too ugly.

I do agree the compiler is being extremely aggressive here, undesirably so

I didn't feel like I had a strong case to file a gcc bug report, since the first thing they have in bold letter is a warning that if fno-strict-aliasing changes the behavior, then it's most probably on the programmer

Something that I can't seem to ignore is that we are first initializing the buffer and then passing it into the as_Root() function and then as_Data(). So regardless of any aliasing issues inside as_Root() or as_Data(), it should at the very least finish initialization before doing the calls. Wouldn't you agree?

Meaning, I think the compiler has the right to provide UB inside these calls since strict-aliasing is not respected. However, I feel like outside the call, the initialization should've finished before doing the calls that require the buffer as a parameter (regardless of inlining)

RafaelLaya commented 4 months ago

Update: Tested on an M7 Target. The issue is gone now and didn't observe any regressions :) -- Thanks a lot!

mikkelfj commented 3 months ago

Thanks for testing. I will merge this.

I like that you added customizability of mem_copy_word

It is a core feature of flatcc to isolate platform dependencies in the portable library, and to run well on many targets, so it is a natural extension of other similar code, like endian conversion and aligned allocation.

Something that I can't seem to ignore is that we are first initializing the buffer and then passing it into the as_Root() function and then as_Data(). So regardless of any aliasing issues inside as_Root() or as_Data(), it should at the very least finish initialization before doing the calls. Wouldn't you agree?

I agree. If you read earlier comment of mine I write that I cannot see any scenario where the compiler should dereference uninitialized stack content it is responsible for.

Having written some compiler backend logic, I can speculate that the compiler injects its own initializaition into the execution graph, then does liveness checking (dependency tracking), then culls any dependency links that violates strict aliasing, then finds two parallel root executation paths that are equally good: read from stack, or initialize stack.

However, this has less to do with optimization of the final output and more to do with how the compiler goes around achieving this.

mikkelfj commented 3 months ago

Merged. I updated notes in CHANGELOG 0.6.2, README cross compilation section.

mikkelfj commented 3 months ago

For the record, the following two commits on master implemented the fix:

pmemaccess.h: c994c4974148cfa7e6b179216ad2da1b01004028

flatcc_accessors.h: cc6e6e5663b65d11aedd54180255df241ba8fd6c

The old (by Intel) deprecated Intel ICC compiler stopped working after these changes unless -O3 is dropped to -O2. We are not sure why, and might just deprecate it from FlatCC weekly builds, but for the record, the only other change between successful and broken weekly build is an update to pstdbool.h: 81d4ba67b66266dc6a4ce761a6091a613ed46a49

mikkelfj commented 3 months ago

ICC build fixed in commit: 7625a37eccc7640afdbde951d579ca5834eadbdb It just defaults to not using pointer casts for that compiler. It will use __builtin_memcpy. It was same issue as this issue, but since we by default chooses non-strict pointer casts on x86/64, and ICC is sensitive to that on -O3, it triggered a segfault. This could just as well have happened earlier on, but was triggered by the very changes that were supposed to fix it, randomly.

A related commit just cleans up bocus ICC warnings: 56d2559caee08c52a7c6e097ea38a7f4b7d40c4a

mikkelfj commented 3 months ago

Added documentation section: https://github.com/dvidelabs/flatcc?tab=readme-ov-file#strict-aliasing