flatgeobuf / flatgeobuf

A performant binary encoding for geographic data based on flatbuffers
https://flatgeobuf.org
BSD 2-Clause "Simplified" License
666 stars 74 forks source link

breaking: varint encoding for column index and field lengths #314

Open michaelkirk opened 9 months ago

michaelkirk commented 9 months ago

Currently column idx are u16 and field lengths (for Strings, Binary, etc.) are u32. I expect in practice that column indexes would almost always fit in a 1 byte varint and field lengths typically in 3 bytes (if not 2).

The properties data is already not random access, it must be processed serially. So there's no loss of functionality there.

This would be a major breaking change, so I don't expect it to be adopted anytime soon, but if you end up making a breaking format release in #81, you should consider piling this on.

I made a prototype here: https://github.com/michaelkirk/flatgeobuf/tree/mkirk/varint

I was working with openaddresses data which is a lot of point geometries with short string columns. Using varints for columns and field lengths outputs a file 85% the size of the original.

bjornharrtell commented 9 months ago

85%! Ouch... :S I can definitely admit to that the properties encoding deserved a bit more thought. I made it quickly after discovering that try to encode it into a generic flatbuffers schema was very space wasteful.

But yeah, a breaking change isn't likely to happen anytime soon or if ever. Might as well make a new format entirely, perhaps a custom binary encoding. I've been thinking lately and from the discussion at https://github.com/flatgeobuf/flatgeobuf/discussions/291 that Flatbuffers (and protobuf) primary function is to allow for evolving schemas but as I see it now it's not an important feature - when a format becomes stable and more or less widespread there is no room for evolution, even backwards compatible.

bjornharrtell commented 9 months ago

That said, alot of short string columns isn't perhaps the most clever data representation.