dvidelabs / flatcc

FlatBuffers Compiler and Library in C for C
Apache License 2.0
632 stars 180 forks source link

Actually support non-standard offset sizes (64-bit, 16-bit) #206

Open TYoungSL opened 2 years ago

TYoungSL commented 2 years ago

Makes FLATCC_OFFSET_SIZE configurable and makes generated code export the compiler's configuration.

It also puts forward some work to make FLATCC_VOFFSET_SIZE configurable, but does not provide a facility to configure it directly through cmake.

mikkelfj commented 2 years ago

I have to look more into this, but please consider that the original design intended for some of these files to be entirely replaced for different size configurations.

mikkelfj commented 2 years ago

Just a heads up: This change, if approved, is too big to merged onto master before a new release. A release is already over due.

TYoungSL commented 2 years ago

It got bigger with the requirement of not using strncpy. :)

Probably for the best, this eliminates any chance of buffer overflows and improves portability.

Minor possible errata in the original parser code; When strictly parsing the file identifier, the interpreted length may not be correct when escape sequences are used (e.g. to add nulls).

Added an option FLATBUFFERS_STRICT_IDENTIFIER_SIZE to use the original code (without fixing the errata) and FLATBUFFERS_RELAXED_IDENTIFIER_SIZE which implies truncation and padding to fit.

TYoungSL commented 2 years ago

FYI nsc is intended to be the namespace that changes when using different derived flatbuffer formats such that original formats can be still be accessed in the same user code.

Yeah, I think I'm using it correctly - you may have to replace the comment after this push (it was a force push, sorry).

Wait for the tests to pass though.

Are you saying I should append the offset size (_16/_64) or something to the namespace?

mikkelfj commented 2 years ago

Are you saying I should append the offset size (_16/_64) or something to the namespace? something like that, or small or wide or large, haven't given this a lot of though.

Note that this is probably a much larger PR than you anticipate. There may be hardcoded uses of "flatbuffer_" prefixes that may need to change. It has to be analyzed how compilation libraries can or cannot coexist with multiple formats. Changing identifier width also spills over to the concept of type hashes - should they be 64-bit in some formats? Should the identifier width change just because the uoffset type does? And if not, where are the hardcoded assumptions to the contrary. etc. etc.

I think the best option is to merge this to a separate branch and mature up from there over time, if you are up for it.

TYoungSL commented 2 years ago

We're currently using it to compare against BigBuffers (a fork of FlatBuffers that is specifically 64-bit) and have had good success with it. Much cleaner than flatc. :)

Separate branch is fine, StirlingLabs will maintain a separate fork and hope to contribute it upstream to master.

The intention is that offset sizes are de facto influences of other sizes save for voffsets, but end users should be allowed to specify any size.

Binary compatibility is dependent on the configuration. Non-standard values do not have any expectation of compatibility outside of that configuration - but it's generally justified by some requirement. (e.g. larger offset address space)

something like that, or small or wide or large, haven't given this a lot of though.

Would probably need to append it to the filename too. Maybe allow the user to specify a suffix (and/or prefix) to the namespace or require one when it's non-default?

mikkelfj commented 2 years ago

On a related note: eventually I'd like to see support for StreamBuffers that build buffers front to back so data can be flushed to disk or set over network without reassembly. There are the same namespace concerns here, although the size will not necessarily change.

mikkelfj commented 2 years ago

Would probably need to append it to the filename too. Maybe allow the user to specify a suffix (and/or prefix) to the namespace or require one when it's non-default?

I'm not really sure - it could also be handled using separate output directories - but since include guards would have to be unique, it kind of suggests that the filenames should be too.

mikkelfj commented 2 years ago

Added an option FLATBUFFERS_STRICT_IDENTIFIER_SIZE to use the original code (without fixing the errata) and FLATBUFFERS_RELAXED_IDENTIFIER_SIZE which implies truncation and padding to fit.

It seems like an easy way to get conflicts - couldn't this be handled with one setting?

Are you sure this is needed. Flatcc already has several ways to specify file identifiers, the typehash version in effect deals with any kind of binary identifier even if not actually a hash.

Minor possible errata in the original parser code; When strictly parsing the file identifier, the interpreted length may not be correct when escape sequences are used (e.g. to add nulls).

As to errata: the file identifier code is surprisingly difficult to get right. But, I think the intention was for consuming human strings up to a maximum length, not to handle binary identifiers which AFAIR can be specified differently.

TYoungSL commented 2 years ago

It seems like an easy way to get conflicts - couldn't this be handled with one setting?

Are you sure this is needed. Flatcc already has several ways to specify file identifiers, the typehash version in effect deals with any kind of binary identifier even if not actually a hash.

The settings would be using cmake_dependent_option (but I suppose the project uses an older cmake?) to provide a mutually exclusive pair, it's a pretty common convention to have an explicit restricting option when one has an implied scenario. Do you just mean it's confusing?

As to errata: the file identifier code is surprisingly difficult to get right. But, I think the intention was for consuming human strings up to a maximum length, not to handle binary identifiers which AFAIR can be specified differently.

That was the original idea behind using strncpy, but it's still possible to specify e.g. "\0\0" or "\r\n" and get a 2 byte string out of 4 characters, or "\x00" to get a 1 byte string. Anyone who does this is met with further problems so it's not plausible that someone would rely on this behavior, so it is errata.

mikkelfj commented 2 years ago

I'm not a CMake expert, it was just my intuitive impression. The CMake version needs to be old to be portable, but there is seperate work on upgrade CMake build and I think they bumped the required version. See unmerged PR pending next release.

As to file identifiers - I don't really recall the details, so I can only provide a general overview and I might not be right in all I say.

TYoungSL commented 2 years ago

By convention C++ comments are never used. Maybe some compilation flag?

Yep, I'll just remove those lines. (edit: Ended up keeping them, modified slightly.)

There's also line 66 in grisu3_parse.h, looks like cdump.h, elapsed.h and cmetrohash.h also have some...

I think all the compilers allow it.

mikkelfj commented 2 years ago

BTW: way back in version 0.4.0 a branch was made with big endian support. Here the 4-byte file identifier was reversed, so "MONS" became "SNOM". It lead to all sort of problems, as you can imagine.

Rjvs commented 2 years ago

Supporting multiple offsets within the same codebase is a feature that this PR opens the gateway toward... but the ability to specify 64-bit offsets (which is mentioned in the docs as possible) seems useful in it's own right. For anyone that just wants to be able to specify 16 or 64bit offsets, I think this PR does the job, without changing anything for normal users.

There are a number of reasons why a user might want to modify namespaces and this PR does add another one but to me it is a completely separate issue.

mikkelfj commented 2 years ago

@Rjvs

Supporting multiple offsets within the same codebase is a feature that this PR opens the gateway toward... but the ability to specify 64-bit offsets (which is mentioned in the docs as possible) seems useful in it's own right

Good point. It should be possible to just redefine the offset size and have it work. Different namespaces is the advanced version and perhaps it would be good to focus on the basics first. In either case, it won't go in before the next release.