Open michaelkirk opened 9 months ago
85%! Ouch... :S I can definitely admit to that the properties encoding deserved a bit more thought. I made it quickly after discovering that try to encode it into a generic flatbuffers schema was very space wasteful.
But yeah, a breaking change isn't likely to happen anytime soon or if ever. Might as well make a new format entirely, perhaps a custom binary encoding. I've been thinking lately and from the discussion at https://github.com/flatgeobuf/flatgeobuf/discussions/291 that Flatbuffers (and protobuf) primary function is to allow for evolving schemas but as I see it now it's not an important feature - when a format becomes stable and more or less widespread there is no room for evolution, even backwards compatible.
That said, alot of short string columns isn't perhaps the most clever data representation.
Currently column idx are u16 and field lengths (for Strings, Binary, etc.) are u32. I expect in practice that column indexes would almost always fit in a 1 byte varint and field lengths typically in 3 bytes (if not 2).
The properties data is already not random access, it must be processed serially. So there's no loss of functionality there.
This would be a major breaking change, so I don't expect it to be adopted anytime soon, but if you end up making a breaking format release in #81, you should consider piling this on.
I made a prototype here: https://github.com/michaelkirk/flatgeobuf/tree/mkirk/varint
I was working with openaddresses data which is a lot of point geometries with short string columns. Using varints for columns and field lengths outputs a file 85% the size of the original.