Closed RafaelLaya closed 3 months ago
Actually taking again a look at this, this line looks dangerous:
vt__tmp = (flatbuffers_voffset_v2_t*)((uint8_t*)(t__tmp)-__flatbuffers_soffset_read_from_pe_v2(
t__tmp));
The table is uint8_t* but we cast into a uint16_t* type, so perhaps this is where the compiler assumes it can't alias?
OK, I'd like to look into this issue if there is anything I can do. So I don't want to reject this up front. But I am not convinced that this is a FlatCC issue.
I cannot easily reproduce your project because I do not have the tool chain, nor necessarily the hardware, you are using. What I need is a schema and C source that can be compiled, and which can trigger warnings about aliasing, possibly by adding more flags to the clang compiler that I use on macOS. Alternatively that you do the same on your platform and list the warnings. If warnings cannot be generated, I'm inclined to assume it is a compiler optimization bug.
Now, given that I don't have enough information, I can speculate:
First, as Wouter AardAppel writes in the link you reference, reading is by definition read only, so aliasing should not result in different data unless you have UB (undefined behaviour) in the implementation.
I am not sure where the problem is exactly, other than it sounds like a vtable lookup issue, but if so, it likely happens many other places for similar reasons, so I'll take this is a symptom, and use the line you posted in your second comment.
In C it is (relatively recently) undefined behaviour to cast a pointer to a type it is not, except if the pointer is 'char *' which is explicitly valid, and necessary, to do pointer math.
Now, this problem is mostly with accessing data the compiler generates, but here we are reading static data from a buffer, so the compiler cannot shuffle things around. Also, we are casting to uint8_t which for all practical purposes is the same as 'char ' and any compiler that cannot handle that, should be fixed as there is too much such code around.
After doing pointer math, the code is cast to the underlying correct type, and if you do things right on your side, it is also correctly aligned in memory. However, if your buffer is not aligned properly, some architectures might run slower, issue access violations, or do other unexpected stuff.
For this reason, the code you reference is not dangerous, except if accessing memory that is not properly aligned.
And as to GCC, I have already removed support for GCC pedantic warnings because they break portable C code, and if this turns to not be easily managed in FlatCC, it may require further downgrading GCC support, e.g. removing optimizations in recommended builds.
For reference below is the code that is being discussed.
I link to the "reflection" code which is generated code that is checked into the repo, and the file is same for all flatcc projects of the same version, so this is not related to the reflection schema in particular.
Early in the file we have the macro
#define __flatbuffers_read_vt(ID, offset, t)\
flatbuffers_voffset_t offset = 0;\
{ flatbuffers_voffset_t id__tmp, *vt__tmp;\
FLATCC_ASSERT(t != 0 && "null pointer table access");\
id__tmp = ID;\
vt__tmp = (flatbuffers_voffset_t *)((uint8_t *)(t) -\
__flatbuffers_soffset_read_from_pe(t));\
if (__flatbuffers_voffset_read_from_pe(vt__tmp) >= sizeof(vt__tmp[0]) * (id__tmp + 3u)) {\
offset = __flatbuffers_voffset_read_from_pe(vt__tmp + id__tmp + 2);\
}\
}
What cannot see from this is arguments being use to call this macro. If there are aliasing issues, this could be relevant. You could try to make the macro a static inline function which should optimize the same but perhaps better isolate contexts for the compiler.
However, the macro is already being used in such inline functions, it seems overkill to add yet another layer which is why it is not.
Here is an example from the same file:
#define __flatbuffers_define_scalar_field(ID, N, NK, TK, T, V)\
static inline T N ## _ ## NK ## _get(N ## _table_t t__tmp)\
{ __flatbuffers_read_vt(ID, offset__tmp, t__tmp)\
return offset__tmp ? __flatbuffers_read_scalar_at_byteoffset(TK, t__tmp, offset__tmp) : V;\
}\
Here is a weekly build that shows all the GCC versions being tested and pass. But that is for whatever architecture Github Actions use:
Further: the macro I mentioned calls a function __flatbuffers_soffset_read_from_pe(t)
that you example line also calls as it is the same logic.
This function is defined in the flatcc_accessors file and use union cast which is generally believed to be a valid cast, at least respected by all living compilers. It could be argued that memcpy should be used, and I did work an that, but that got rather invasive and not worthwhile. Historically, memcpy would be slow, but lately it is understood to be a cast operator by most compilers.
https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/flatcc_accessors.h
You could try to play around with different cast implementations, or replace the read function.
The __flatbuffers_soffset_read_from_pe(t)
function takes a pointer to an aligned 32-bit memory address and returns a signed 32-bit integer (int32_t). If the platform is big-endian, it also swaps the bytes before returning the result.
pe
means protocol endian, meaning it converts between host and the protocol endianness, which is little endian for flatbuffers, but other formats can also be supports.
Thanks! You are right, I could've done a better job producing a seamless repro. Here it is https://github.com/RafaelLaya/flatcc/pull/1 I forked FlatCC and modified the Monster example to show this. You should be able to cd mymonster && scripts/build.sh && ./build/mymonster
and it should work. Just make sure to have arm-none-eabi-gcc
available in your system. It will run on your system and work fine, but it will also build for ARMv7 and produce out.lst
which is the dissasembly. From the dissasembly, we can clearly see the compiler produces code where we will be reading garbage from the stack and making decisions based on this garbage (which leads to an assertion)
Thanks! Please let me know if you come to any conclusions
I had a look at your PR but don't currently have a tool chain to test with. I will see what I can do. Meanwhile, please consider experimenting with replacing functions in the accessors and common files discussed above. This is where a possible FlatCC fix would happen.
That said, it really looks like a compiler bug but I will not rule out anything since these things are notoriously tricky.
I cannot see any scenerio where the compiler should think it is a good idea to dereference memory it is responsible for initializing before it has initialized that memory area, which is what it does when dereferencing the stack pointer after reserving space. Of course someone could argue that the optimizer need not check for that if the code it compiles is correct, but I find that a hard sell, and the code works on numerous other platforms, also with GCC compiler.
In the PR you have a correctly aligned memory buffer that is compiler initialized. Aside from the unfortunately named builder_t type that should have been buffer_t or something it looks OK in source.
Just to explain what happens in the buffer content and what the C source is supposed to do: offset 0: 4, 0, 0, 0 : find the root table at offset 4 by adding to the buffer start location (which must be aligned). Now we have get_as_root. offset 4: negative 4 byte vtable offset. Subtract from current offset 4: 4 - (230 - 256) = 30. offset 30: 8, 0, 9, 0, 8, 0, 4, 0 : this is the vtable in 16-bit elements the first two bytes 8, 0 is the vtable size. The second two bytes 9, 0 is the table size of the referencing table Table field with ID 0 is then voffset 8, 0 relative to the table at offset 4, so field ID zero starts at buffer offset 4 + 8 = 12. Table field with ID 1 is then voffset 4, 0 at buffer offset 4 + 4 = 8
For reference: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#example
The soffset read function discussed above fetches the offset from one field in the vtable, and the compiler is supposed to optimize that read to relative to the vtable start very efficiently. It seems too efficiently in this case.
#define TABLE_PREDETERMINED_MSG {4, 0, 0, 0, 230, 255, 255, 255, 8, 0, 0, 0, 1, 0, 0, 0, 248, 255, 255, 255, 128, 255, 255, 255, 6, 0, 8, 0, 4, 0, 8, 0, 9, 0, 8, 0, 4, 0, 0, 0}
/* msg_buffer
typedef struct {
uint8_t buffer[40] __attribute__ ((aligned (8)));
} builder_t;
int run_test(void) {
builder_t msg = {TABLE_PREDETERMINED_MSG};
ns(Message_table_t) msg_table_root = ns(Message_as_root(msg.buffer));
ns(CodeTable_table_t) table_ptr = ns(Message_data(msg_table_root));
(void)table_ptr;
/* ... */
return 0;
}
OK, I downloaded the eabi-none toolchain and tested, not the PR but the original zip project you provided as I had that at hand already, and it turns out to be rather useful for modifying the flatcc library code for testing. However, I modified the buffer type to be aligned, as you also have done in the PR branch.
I get the assembly you documented, i.e. I can reproduce.
I then modified the function that reads the soffset to find the vtable location to make it use memcpy:
static inline flatbuffers_soffset_v2_t __flatbuffers_soffset_read_from_pe_v2(const void* p) {
#if 1 /* modified version */
flatbuffers_soffset_v2_t word;
memcpy(&word, p, sizeof(word));
return __flatbuffers_soffset_cast_from_pe_v2(word);
#else /* original flatcc library code */
return __flatbuffers_soffset_cast_from_pe_v2(*(flatbuffers_soffset_v2_t*)p);
#endif
}
And with the compile flags you provided, that is -O2 enabled, I get the following, which superficially seems to be better. But the memcpy is horribly ineffecient. The code does none of the memcpy is a cast optimizations, which is also one reason I did not make that change to accessors early - too many compilers that might do that slow.
00000000 <interface_encoding_populate_general_rsp>:
0: b570 push {r4, r5, r6, lr}
2: 4c15 ldr r4, [pc, #84] @ (58 <interface_encoding_populate_general_rsp+0x58>)
4: b08c sub sp, #48 @ 0x30
6: 4606 mov r6, r0
8: f10d 0c08 add.w ip, sp, #8
c: ad05 add r5, sp, #20
e: cc0f ldmia r4!, {r0, r1, r2, r3}
10: e8ac 000f stmia.w ip!, {r0, r1, r2, r3}
14: cc0f ldmia r4!, {r0, r1, r2, r3}
16: e8ac 000f stmia.w ip!, {r0, r1, r2, r3}
1a: e894 0003 ldmia.w r4, {r0, r1}
1e: 2204 movs r2, #4
20: e88c 0003 stmia.w ip, {r0, r1}
24: 4629 mov r1, r5
26: eb0d 0002 add.w r0, sp, r2
2a: f7ff fffe bl 0 <memcpy>
2e: 9b01 ldr r3, [sp, #4]
30: 1aed subs r5, r5, r3
32: 882b ldrh r3, [r5, #0]
34: 2b07 cmp r3, #7
I would say that this isn't really an acceptable solution.
If I do not modify the soffset reader, but change the build target:
# original flags
#FLAGS="-O2 -Wall -Wextra -ffreestanding -mcpu=cortex-m7 -mthumb -mabi=aapcs -std=c99"
FLAGS="-O2 -Wall -Wextra -ffreestanding -mthumb -mabi=aapcs -std=c99"
then I get the following which looks more sane superficially:
00000000 <interface_encoding_populate_general_rsp>:
0: b5f0 push {r4, r5, r6, r7, lr}
2: b08b sub sp, #44 @ 0x2c
4: 4669 mov r1, sp
6: 4c0f ldr r4, [pc, #60] @ (44 <interface_encoding_populate_general_rsp+0x44>)
8: 000b movs r3, r1
a: 0022 movs r2, r4
c: cae0 ldmia r2!, {r5, r6, r7}
e: c3e0 stmia r3!, {r5, r6, r7}
10: cae0 ldmia r2!, {r5, r6, r7}
12: c3e0 stmia r3!, {r5, r6, r7}
14: cae0 ldmia r2!, {r5, r6, r7}
16: c3e0 stmia r3!, {r5, r6, r7}
18: 6812 ldr r2, [r2, #0]
1a: 601a str r2, [r3, #0]
1c: 888b ldrh r3, [r1, #4]
1e: 2b07 cmp r3, #7
I also don't get early stack reference with the original build flags, but cortex-m0 or cortext-m1 instead:
FLAGS="-O2 -Wall -Wextra -ffreestanding -mcpu=cortex-m0 -mthumb -mabi=aapcs -std=c99"
#FLAGS="-O2 -Wall -Wextra -ffreestanding -mcpu=cortex-m1 -mthumb -mabi=aapcs -std=c99"
${GCC_PATH}/arm-none-eabi-gcc $FLAGS -o out.obj -c tempfiles.c
Regardless of who is to blame, I needed to warn about this issue in the cross compilation section. https://github.com/dvidelabs/flatcc?tab=readme-ov-file#cross-compilation
I think technically this line violates strict aliasing, but adding attribute((__may_alias__)) does not seem to help regardless of where I put it.
return __flatbuffers_soffset_cast_from_pe_v2(*(flatbuffers_soffset_v2_t*)p);
As mentioned, memcpy seems to work but optimizes poorly.
I also tried const char *q = p; return q[0] | (int32_t)(q[1]) << 8 | ...;
and this seems to optimize better, but that is a real pain to deal with every time you want to read from a memory buffer.
I also tried using a malloc buffer instead of a stack allocated array and this also seems to work better.
All of this from looking at assembly as I cannot test the target.
If there is no better way to support reading two bytes from memory without conflicting with the C compiler, I'm not sure it is worth supporting.
Using __builtin_memcpy
static inline flatbuffers_soffset_v2_t __flatbuffers_soffset_read_from_pe_v2(const void* p) {
#if 0
flatbuffers_soffset_v2_t word;
memcpy(&word, p, sizeof(word));
return __flatbuffers_soffset_cast_from_pe_v2(word);
#elif 1
flatbuffers_soffset_v2_t word;
__builtin_memcpy(&word, p, sizeof(word));
return __flatbuffers_soffset_cast_from_pe_v2(word);
#elif 0
flatbuffers_soffset_v2_t word;
word = ((int32_t)((const char *)p)[0]) | ((int32_t)((const char *)p)[1]) << 16 |
((int32_t)((const char *)p)[2]) << 16 | ((int32_t)((const char *)p)[3]) << 24;
return word;
#else
return __flatbuffers_soffset_cast_from_pe_v2(*(flatbuffers_soffset_v2_t*)p);
#endif
}
00000000 <interface_encoding_populate_general_rsp>:
0: b510 push {r4, lr}
2: 4c0a ldr r4, [pc, #40] @ (2c <interface_encoding_populate_general_rsp+0x2c>)
4: b08a sub sp, #40 @ 0x28
6: 4686 mov lr, r0
8: 46ec mov ip, sp
a: cc0f ldmia r4!, {r0, r1, r2, r3}
c: e8ac 000f stmia.w ip!, {r0, r1, r2, r3}
10: cc0f ldmia r4!, {r0, r1, r2, r3}
12: e8ac 000f stmia.w ip!, {r0, r1, r2, r3}
16: e894 0003 ldmia.w r4, {r0, r1}
1a: 2228 movs r2, #40 @ 0x28
1c: e88c 0003 stmia.w ip, {r0, r1}
20: 4669 mov r1, sp
22: 4670 mov r0, lr
24: f7ff fffe bl 0 <memcpy>
28: b00a add sp, #40 @ 0x28
2a: bd10 pop {r4, pc}
2c: 00000000 .word 0x00000000
I added a new branch fix-accessors with currently two commits https://github.com/dvidelabs/flatcc/tree/fix-accessors
It adds pmemaccess.h to provide __builtin_memcpy and other methods to read and write memory. It also updates flatcc_accessors.h to use these instead of (T )p casts for read and write.
Note that pmemaccess still uses pointer casts on x86/64 since it has worked so far, and memcpy might not work well on all platforms.
There is a related issue with punaligned.h that reads unaligned memory, mostly for JSON parsing. It already does something similar with pointer access on x86/64 and not on other platforms.
Just for the record: @RafaelLaya commented directly on the fix-accessors commits.
Also, I'm going to do some further updates to call mem_copy_read_nn and mem_copy_write_nn directly from flatcc_accessors, instead of calling mem_copy_word directly. This will make it possible to use pointer casts like *(T *)p
as before, which is probably better on older MSVC compilers and possible other targets. On ARM / gcc the effect will be the same, calling mem_read_16 -> mem_copy_word -> builtin_copy, but I need to work on it a bit more to pass tests.
The behaviour can also be controlled by -DPORTABLE_MEM_PTR_ACCESS=0 (or 1), and also by defining mem_copy_word before header inclusion. This might be relevant on some targets, since this area of operation is extremely performance sensitive, and also code size sensitive.
I have now update the fix-accessors branch. I needed to add mem_read/write_float_32/64 as well. It passes std CI tests.
If you are happy with the design, can you please confirm that it solves the issue on actual -mcu=cortext-m7 -O2 compiled targets, if possible?
I still think this is a compiler issue, it is not practical to disallow C code with simple pointer memory access, but at least we appear to have a workable solution that is not too ugly.
Sorry I reviewed last night and forgot to respond here. I like that you added customizability of mem_copy_word and yes I'm happy with the design. I'm sure from the assembly it should be fixed now, but I'll try to check on a target just in case as you suggested
I still think this is a compiler issue, it is not practical to disallow C code with simple pointer memory access, but at least we appear to have a workable solution that is not too ugly.
I do agree the compiler is being extremely aggressive here, undesirably so
I didn't feel like I had a strong case to file a gcc bug report, since the first thing they have in bold letter is a warning that if fno-strict-aliasing
changes the behavior, then it's most probably on the programmer
Something that I can't seem to ignore is that we are first initializing the buffer and then passing it into the as_Root() function and then as_Data(). So regardless of any aliasing issues inside as_Root() or as_Data(), it should at the very least finish initialization before doing the calls. Wouldn't you agree?
Meaning, I think the compiler has the right to provide UB inside these calls since strict-aliasing is not respected. However, I feel like outside the call, the initialization should've finished before doing the calls that require the buffer as a parameter (regardless of inlining)
Update: Tested on an M7 Target. The issue is gone now and didn't observe any regressions :) -- Thanks a lot!
Thanks for testing. I will merge this.
I like that you added customizability of mem_copy_word
It is a core feature of flatcc to isolate platform dependencies in the portable library, and to run well on many targets, so it is a natural extension of other similar code, like endian conversion and aligned allocation.
Something that I can't seem to ignore is that we are first initializing the buffer and then passing it into the as_Root() function and then as_Data(). So regardless of any aliasing issues inside as_Root() or as_Data(), it should at the very least finish initialization before doing the calls. Wouldn't you agree?
I agree. If you read earlier comment of mine I write that I cannot see any scenario where the compiler should dereference uninitialized stack content it is responsible for.
Having written some compiler backend logic, I can speculate that the compiler injects its own initializaition into the execution graph, then does liveness checking (dependency tracking), then culls any dependency links that violates strict aliasing, then finds two parallel root executation paths that are equally good: read from stack, or initialize stack.
However, this has less to do with optimization of the final output and more to do with how the compiler goes around achieving this.
Merged. I updated notes in CHANGELOG 0.6.2, README cross compilation section.
For the record, the following two commits on master implemented the fix:
pmemaccess.h: c994c4974148cfa7e6b179216ad2da1b01004028
flatcc_accessors.h: cc6e6e5663b65d11aedd54180255df241ba8fd6c
The old (by Intel) deprecated Intel ICC compiler stopped working after these changes unless -O3 is dropped to -O2. We are not sure why, and might just deprecate it from FlatCC weekly builds, but for the record, the only other change between successful and broken weekly build is an update to pstdbool.h: 81d4ba67b66266dc6a4ce761a6091a613ed46a49
ICC build fixed in commit: 7625a37eccc7640afdbde951d579ca5834eadbdb It just defaults to not using pointer casts for that compiler. It will use __builtin_memcpy. It was same issue as this issue, but since we by default chooses non-strict pointer casts on x86/64, and ICC is sensitive to that on -O3, it triggered a segfault. This could just as well have happened earlier on, but was triggered by the very changes that were supposed to fix it, randomly.
A related commit just cleans up bocus ICC warnings: 56d2559caee08c52a7c6e097ea38a7f4b7d40c4a
Added documentation section: https://github.com/dvidelabs/flatcc?tab=readme-ov-file#strict-aliasing
bug_for_flatcc.zip
Hi everyone! This is my first bug report contribution into the FlatCC project. Thanks for making such an awesome project! Here are the details:
Files attached
Explanation of the source code provided in tempfiles.c
The code in question is in the function interface_encoding_populate_general_rsp(). writes a message determined at compile-time into an array in the stack, and then finishes the message and copies into a buffer given by the caller.
If one steps through the code, essentially the following happens:
From the initialization of msg_array we can see everything should be initialized correctly, then Response_as_root_v2() and Reponse_data_v2() will work exactly as described above.
Actual Issue
The compiler smartly determines that it needs to look at msg_array[4] and check if it is less than 8. However, it reads msg_array[4] into R5 which has garbage in the stack. Then it initializes msg_array in the stack, and then compares R5 with 7 (branches to assert if less-or-equal). The problem is that by the time the comparison is done, the array is partly initialized and R5 has garbage from the stack before the initialization of the array
a: f8bd 5004 ldrh.w r5, [sp, #4] e: cc0f ldmia r4!, {r0, r1, r2, r3} 10: e8ac 000f stmia.w ip!, {r0, r1, r2, r3} 14: cc0f ldmia r4!, {r0, r1, r2, r3} 16: 2d07 cmp r5, #7
Schema
The relevant portion looks like this: A table that requires an union inside, and the union has one of many tables
table MyTableResponse{ // Some integers } union ResponseData { // A bunch of tables /// Generic response (success w/ no payload or error). MyTable: MyTableResponse, }
table Response { data: ResponseData (required); }
Background
I found this by Aardappel which seems to be an user on stackoverflow who is highly involved with the project: https://stackoverflow.com/questions/24330925/does-flatbuffers-avoid-strict-aliasing-somehow
Workarounds