dvidelabs / flatcc

FlatBuffers Compiler and Library in C for C
Apache License 2.0
631 stars 180 forks source link

Table above 64 KB fails, was Mismatched in create_cached_vtable() #235

Closed michbens closed 8 months ago

michbens commented 2 years ago

Hello

I have a quit complex shema like this:

namespace StCfg;

struct FilterConfig
{
    field1:uint32;
    field2:uint32;
    ...........
    field54:uint32;
    ...........
    field75:uint32;
}

struct FilterTable
{
    filterConfig:[FilterConfig:10];
}

struct StaticCfg
{
    filterTable:FilterTable;
    staticCfgField2:StaticCfgField2_Struct;
    staticCfgField3:uint32;
}

-----------------------

namespace Core;

table CoreCfg
{
    coreCfgField1:CoreCfgField1_Sruct(id:0);
    coreCfgfield2:CoreCfgField2_Sruct(id:1);
    coreCfgfield3:CoreCfgField3_Sruct(id:2);
    staticCfg:StCfg.StaticCfg(id:3);
    coreCfgfield5:CoreCfgField5_Sruct(id:4);
    coreCfgfield6:CoreCfgField6_Table(id:5);
}

root_type CoreCfg;

-----------------------

namespace Cfg;

union CfgPayload {Core.CoreCfg}

table Configuration
{
    payload:CfgPayload;
}

root_type Configuration;

----------------------

namespace Msg;

union HriPayload {Cfg.Configuration, Ctrl.Control, Dbg.Debug, Response}

table Message
{
    payload:HriPayload;
}

root_type Message;

To build Message as root I create structs and tables, including CoreCfg table like this:

Core_CoreCfg_ref_t core_cfg = Core_CoreCfg_create(B, &coreCfgField1, // struct
                                                     &coreCfgField2, // struct
                                                     &coreCfgField3, // struct
                                                     &staticCfg,
                                                     &coreCfgField5, // struct
                                                     coreCfgField5 ); // table

Cfg_CfgPayload_union_ref_t cmd = Cfg_CfgPayload_as_CoreCfg(core_cfg);

Cfg_Configuration_ref_t cfg_msg = Cfg_Configuration_create(B, cmd);

Msg_HriPayload_union_ref_t msg_payload = Msg_HriPayload_as_Configuration(cfg_msg);

Msg_Message_create_as_root(B, msg_payload);

When calling Cfg_Configuration_create() it appears that field54 in struct FilterConfig is corrupt.

Looking inside I found that it occurs in flatcc_builder_create_cached_vtable() Calling stack: Cfg_Configuration_create() -> Cfg_Configuration_end() -> flatcc_builder_end_table() -> flatcc_builder_create_cached_vtable()

In flatcc_builder_create_cachedvtable() there is a memcpy(vt, vt, vtsize); and vt points in the middle of struct FilterConfig (field54)

It looks like a bug but I may do something wrong.

Thanks

mikkelfj commented 2 years ago

Well, that doesn't sound quite right, but for a start, why do you have id:0 on multiple table table fields? And why do you use id's at all? And why does flatcc not raise an error when id is reused?

As to pointing into the middle of something: that may be a problem but it does not have to be: flatcc uses a temporary stack (several actually), and when data have been completed they are written out to emitter object that builds up the linear buffer data. Therefore a temporary pointer to some object might go out of scope and some other data may be placed on the same address on the stack.

However, I'm not sure that vtables are placed on the same stack as tempory table fields so it could be a problem. Also, maybe you mean that vt_ is pointing into user memory to the source of the original data?

mikkelfj commented 2 years ago

If there is a bug, ignoring the id:0 issue, it is possibly triggered by unusually large amounts of struct data that could trigger a stack reallocation that isn't handled correctly in that situation. In general reallocation should not be an issue, but there have been bugs there before.

mikkelfj commented 2 years ago

Table data is limited to a little under 64 KB because vtable offsets are 16 bits unsigned. I don't think you are hitting that limit, but as far as I can tell, you haven't provided all details of the schema, so you might be hitting that issue. You should get an error then, but who knows? Not likely the issue here, but worth mentioning.

michbens commented 2 years ago

There is no id:0 issue. Sorry it's my mistake (copy paste). I corrected it. Anyway, removing the ids doesn't help.

After calling Msg_Message_create_as_root(B, msg_payload), I call flatcc_builder_finalize_buffer(B, sizeptr) and the return buffer is my flatbuffer binary. When I say that vt points in the middle of struct FilterConfig, I mean in this binary buffer (sorry if I wasn't clear). The data after field54 (field55, field56, ...) is not corrupted (vt_size = 8).

The final binary buffer has a size of 96,278 Bytes, a little bit more than 94KB You are right, I haven't provided all the schema details because it is too big. I have several struct of structs and vectors (including vectors of structs)

mikkelfj commented 2 years ago

OK, I suspected id:0 wasn't the issue. It appears that you only really store any significant amount of data in the single Configuration table. If that is the case, you are above 64K for a single table since the rest of the overhead would be much smaller than 30K up to the 94 KB you mention. However, as you say, you did not provide all information, so maybe data is hiding elsewhere.

Either way, I suggest you check the return values of all operations to see if any of them fail. There are also options to compile the builder with some debug assertions, look at the source.

Once return errors have been ruled out, it may be necessary to debug. For that I would need a fully reproducible project. It can happen over a private channel if you prefer.

mikkelfj commented 2 years ago

One potential cause of the error is that large structs (or really any large table field) might not be written correctly to the ring buffer in the emitter. vtables are placed at the end of flatcc binary buffer, but during construction that is in the middle of the first page of buffer. vtables are appended, and other data is prepended, and the ring buffer can grow in both directions. If the struct data is somehow allowed to expand into the vtable area of the first page, that could explain what you are seeing. The ring buffer is a doubly linked list with about 2 KB pages by default, AFAIR. You could debug the emit calls to see if that happens.

mikkelfj commented 2 years ago

https://github.com/dvidelabs/flatcc/blob/master/src/runtime/emitter.c

This is the default implementation, you can also add your own, here one that only prints information and stores nothing: https://github.com/dvidelabs/flatcc/blob/master/test/emit_test/emit_test.c

mikkelfj commented 2 years ago

To further understand this, the emitter uses address 0 to refer to the middle of the first page. Ordinary ref_t values are negative values relative to this location. If the offset is wrong, data might land in the wrong place. vtables are written with positive offsets. However, a vtable with address 1 really means address 0, but is off by 1 because 0 indicates error in the API.

michbens commented 2 years ago

"It appears that you only really store any significant amount of data in the single Configuration table." Yes that is the case. The return values are OK.

" It can happen over a private channel if you prefer" Maybe it's better

mikkelfj commented 2 years ago

You have my email in my profile.

mikkelfj commented 2 years ago

Anyway, if you are writing that much data, that would be your problem. So yes, maybe there should have been an error check for that, but it would not make your problem go away.

Also, it may be better to use table fields for configuration data, especially if you generally only use a subset of those fields (I do that in some project). The vtables will be larger, but you will store much less data because unused fields do not take up space, unlike structs. Your are ultimately still limited by the table size if you use all data, so you may have to split up the data regardless of what you do.

mikkelfj commented 2 years ago

Alternatively, you can store data vectors of structs. These are limited to 2 GB. This is different from fixed length arrays used inside structs.

michbens commented 2 years ago

"so you may have to split up the data regardless of what you do" You mean by sending several packets containing a part of the configuration instead of sending a big one including all the configuration ?

mikkelfj commented 2 years ago

No, I mean using multiple tables in the same buffer.

The table_add method calls push_ds_field which computes the size of the table:

https://github.com/dvidelabs/flatcc/blob/5ff2e604a3e5a44d3885ff6887890c689b0a6b59/src/runtime/builder.c#L1811

https://github.com/dvidelabs/flatcc/blob/5ff2e604a3e5a44d3885ff6887890c689b0a6b59/src/runtime/builder.c#L267

You might be able to instrument this to see if you hit the ds_limit. It should return a null ref if you are over the limit:

if ((B->ds_offset = offset + size) >= B->ds_limit) {
        if (reserve_ds(B, B->ds_offset + 1, table_limit)) {
            return 0;
        }
    }

That null is forwarded by table_add, and hopefully all the way out to end user calls?

mikkelfj commented 2 years ago

To be clear on splitting: A table stores a struct in its entirety within its own memory block (limited to 64 KB). A table stores child tables, vectors, strings, and unions as offsets that consum only 4 bytes within its own block. A vector can be as large as the whole buffer minus overhead and it can hold structs inline in its own memory block, or it can store 4 byte offsets to tables, strings, and unions. So there is plenty of ways to reduce memory usage of a single table.

michbens commented 2 years ago

Actually I need to send the all configuration (the all 96K) so it won't help to split right ?

mikkelfj commented 2 years ago

It will, but it may not help to use table fields over struct fields. If you store the struct in a vector, you force it to be allocated outside of the table. You can also place the struct in a union, it will also force allocation outside of the table.

mikkelfj commented 2 years ago

However, I don't think it is a good idea to have such large structs: they must also be valid in all C compilers, and they could easily have a limit of, say 128 KB that you might hit sooner that you would like.

https://stackoverflow.com/questions/7724790/are-there-any-size-limitations-for-c-structures#7724909

michbens commented 2 years ago

Do we have the same problem with flatc ?

mikkelfj commented 2 years ago

flatc also represents structs as native C++ structs as far as I recall.

But it seems to me that a vector of smaller structs would be much better way to deal with your problem. The memory layout can even be the same since vectors of scalars and structs are largely C arrays. The only exception is that you need to use the generated API accessors if you are on big endian systems, but that also applies in your current setup.

Regardless, I suggest

1) confirm that this is actually the cause of your problem, e.g. by setting a breakpoint in push_ds_field. 2) even if not, the current design is probably not a good idea, as discussed.

If you cannot confirm the problem, we may have to consider debugging flatcc. If you can confirm the problem in push_ds_field, we still need to see if the error is actually returned to the user api. Please remember that ref_t return codes indicate error with 0 instead of a valid reference, but integer return codes use 0 for success (fragile, but as per Unix convention).

michbens commented 2 years ago

This is what I get by adding prints in push_ds_field:

push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 544, uint16_t align => 4, voffset_t id => 0)
        offset = 0
        return 00007FF6A2399620
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 24, uint16_t align => 4, voffset_t id => 1)
        offset = 544
        return 00007FF6A2399840
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 2072, uint16_t align => 4, voffset_t id => 2)
        offset = 568
        return 00007FF6A2399C68
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 2072, uint16_t align => 4, voffset_t id => 3)
        offset = 2640
        return 00007FF6A239A480
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 4, uint16_t align => 4, voffset_t id => 4)
        offset = 4712
        return 00007FF6A239AC98
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 7860, uint16_t align => 4, voffset_t id => 1)
        offset = 4
        return 00007FF6A2399A34
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 7860, uint16_t align => 4, voffset_t id => 2)
        offset = 7864
        return 00007FF6A239FD60
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 31760, uint16_t align => 4, voffset_t id => 3)
        offset = 15724
        return 00007FF6A23A1C14
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 18528, uint16_t align => 4, voffset_t id => 4)
        offset = 47484
        return 00007FF6A23B9834
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 18528, uint16_t align => 4, voffset_t id => 5)
        offset = 66012
        return 00007FF6A23BE094
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 456, uint16_t align => 8, voffset_t id => 0)
        offset = 0
        return 00007FF6A23ADEB8
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 752, uint16_t align => 8, voffset_t id => 1)
        offset = 456
        return 00007FF6A23AE080
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 1776, uint16_t align => 8, voffset_t id => 4)
        offset = 1208
        return 00007FF6A23AE370
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 304, uint16_t align => 4, voffset_t id => 2)
        offset = 2984
        return 00007FF6A23AEA60
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 3632, uint16_t align => 4, voffset_t id => 3)
        offset = 3288
        return 00007FF6A23AEB90
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 1, uint16_t align => 1, voffset_t id => 0)
        offset = 4
        return 00007FF6A23ADEBC
push_ds_field(flatcc_builder_t *B => 000000D5F66FF340, uoffset_t size => 1, uint16_t align => 1, voffset_t id => 0)
        offset = 4
        return 00007FF6A23ADEBC

It seems there is no error

michbens commented 2 years ago

I'm not sure this is a size issue. I checked the sizes and I get the following:

table CoreCfg
{
    coreCfgField1:CoreCfgField1_Sruct(id:0);       // ~ 1K
    coreCfgfield2:CoreCfgField2_Sruct(id:1);       // ~ 1K
    coreCfgfield3:CoreCfgField3_Sruct(id:2);       // ~ 1K
    staticCfg:StCfg.StaticCfg(id:3);               // ~ 4K
    coreCfgfield5:CoreCfgField5_Sruct(id:4);       // ~ 2K
    coreCfgfield6:CoreCfgField6_Table(id:5);       // ~ 92K
}

When CoreCfgField6_Table is a table containing 6 structs of the following sizes:

table CoreCfgField6_Table
{
    f1:S1;  // ~ 9K
    f2:S2;  // ~ 8K
    f3:S3;  // ~ 8K
    f4:S4;  // ~ 31K
    f5:S5;  // ~ 19K
    f6:S6;  // ~ 19K
}

So there is no structs more than 64K

mikkelfj commented 2 years ago

Thanks for data.

The problem is that the sum of the individual fields reach more than 64 KB. In your example f6 cannot be reached if it is stored last in the table.

The main problem is that vtable offsets are 2 bytes so they cannot point beyond that. But in principle the end of the last field could go beyond that. However, the table size is also stored in 2 bytes so at the very least the buffer should fail verification. Therefore the builder should also reject such large tables.

I am not sure why you do not get an error, but this area probably hasn't be used or tested much since you are not allowed to do it anyway.

My next suggestion:

Reduce the size of the table gradually until you get below 64 KB just for testing. Run the verifier on buffer each time you reduce, and see if you can confirm success once you below the limit.

If you can show the buffer is valid below 64 KB and if you also confirm that none of the data is corrupted (the verifier does not check everything), then:

1) you know where the problem with corruption is. 2) we need to figure out why flatcc does not produce an error.

We already know you are doing something wrong right now, but that does not rule out a bug in flatcc also.

I think the data you posted in the previous mail is interesting. Please keep the code for that project as is, even if you make changes. We might not to add more debugging to see if and where error reporting fails.

mikkelfj commented 2 years ago

To be clear:

CoreCfg has 1 field of size 92 KB, but that is a child table so it does not count. The problem is CoreCfgField6_Table because all the fields are structs, and they are all stored in the same place that must be less than 64 KB in total.

Also, the problem with maximum struct size is more C compiler issue than a FlatBuffer issue. In principle very large structs could be stored in vectors even if they do not fit in tables.

mikkelfj commented 2 years ago

I think I know why the error reporting fails:

The check in push_ds_table is only to see if the internal builder stack has enough space. If not, it allocates more memory. If that fails, it returns 0.

This is fine, but it doesn't check the table size, that is not the purpose of that check. Now, flatcc builder could add check for that, but I think it would just make the builder slower without any real value. Even if you violate the limit, you rarely check return codes because the builder only fails if it runs out of memory, or if it is used incorrectly.

So I think an error would be appropriate, but not worth runtime nor development effort.

This leaves us with you confirming that there is no problem when you go below 64 KB in a table.

michbens commented 2 years ago

If I change all CoreCfgField6_Table fields to tables it will be OK ?

mikkelfj commented 2 years ago

If I change all CoreCfgField6_Table fields to tables it will be OK ?

Yes, that should be fine, at least with the sizes you currently have.

mikkelfj commented 8 months ago

This is a bit too involved for me to test, but I added a runtime check to fail and optionally assert if the table is too large for the FlatBuffers voffset_t data type. I think this was the assumed root cause of this issue. Fixed in f8c346f .