TWKB / Specification

An effort to come up with a compressed binary vector geometry specification
86 stars 10 forks source link

Grand Unification #12

Closed pramsey closed 9 years ago

pramsey commented 9 years ago

I have to admit, I find the whole "basic types" and "group types" thing galling and unnecessary. I see the proximate desire for them, to squeeze out unneeded metadata, but I'm sure it can be done without all these extra types. For example, imagine this MultiPoint:

metadata_header   byte
[size]            varint
type_and_dims     byte
[bounds]          bbox
[ids]             varint[]
coordinates       varint[]

See what I did there? By taking the per-point id and pulling it up front into an array, I am having cake and eating it too. Similar things work for multilinetrings...

Since you're already committing to deserializing every coordinate in the multilinestring, by virtue of having only one absolute coordinate in the whole object, you can pack all the structure information up front, and leave the coordinate list "pure", like this:

metadata_header   byte
[size]            varint
type_and_dims     byte
[bounds]          bbox
[ids]             varint[]
ngeoms            varint
npoints           varint[]
coordinates       varint[]

Going even further in this vein, you can get to

metadata_header   byte
[size]            varint
type_and_dims     byte
[bounds]          bbox
[ids]             varint[]
ngeoms            varint
npoints           varint[]
x                 varint[]
y                 varint[]
[z]               varint[]

Which, if nothing else, would have lovely compression characteristics.

But I'm getting off-topic. The point is if the ids get pulled up into an optional idlist, you can have multiple id objects that coexist with single id objects.

nicklasaven commented 9 years ago

Interesting

I see the point about "pure" coordinate list and an n-points array.

But i am not convinced that it works. When reading those types, the client will always need information about how long a list is before it starts reading it, like the id-list. But maybe that is just a matter of putting ngeoms in front of id-list.

But if single id objects is mixed with multiple id objects coexist, how shall the client know how to combine geometries and id's. If it is 3 id's and 5 linestrings, something have to define what id belong to what linestring(s)

Am I missing something? Could you exemplify?

Last example about separating dimensions is also interesting. But why would it give better compression? I see the code will be cleaner in both backend and client because it just have to keep track of 1 absolute value. Now it keeps an array with 1 value per dimension.

One thing that could point against this is the handling of the coordinates both at back end and client. If it is huge geometries (like aggregating a bigger data set to gain from avoiding absolute coordinates between geometries) then it will be some effort to rearrange the coordinates from x,y,z,x,y,z

to

x,x,y,y,z,z

pramsey commented 9 years ago

Yes, you're right, the ngeoms has to come before the id list

metadata_header   byte
[size]            varint
type_and_dims     byte
[bounds]          bbox
ngeoms            varint
[ids]             varint[]
npoints           varint[]
x                 varint[]
y                 varint[]
[z]               varint[]

As I see if you have two use cases:

Let me try to write out a generic geometry collection that handles both cases again

metadata_header   byte
[size]            varint
type_and_dims     byte
[bounds]          bbox
ngeoms            varint
[ids]             varint[]
geoms             geom[]

For the "copy WKB" case, the ids are optional, so they are left out. For the "grouping" case, the ids come along for the ride. I'm not sure I see a use case where you can get a group that itself has an id, and all the components also have ids. What function would generate that kind of thing? An aggregation would bundle up a bunch of rows (that have ids) into a group that itself would not have an id (because it's brand new, it's synthetic).

nicklasaven commented 9 years ago

Now i think I follow you. So what is changed is that the id's is in front of the geometries. Yes, why not? It would make it much faster if nothing else to scan if an id is present.

About a group of geometries with individual id's and a top level id I have thought about it. The use case would be to do aggregation to get some sort of tiles and to get a tile ID. The function would then be designed to pick the top level id from one of the "group by" fields. I don't know how to control that, but that would be the logic I guess.

I have no problems to leave that idea

pramsey commented 9 years ago

Once you accept the ideas of [ids] being before geometries, we move on to the next level which is that it's possible to do away with the special "group" type altogether. Since now a multi-point-with-ids looks just like type 20, the "homogeneous group", right?

pramsey commented 9 years ago

And we end up with two aggregation signatures: collect_twkb(geom, id) and collect_twkb(geom) (ignore for a moment that I'm making up the SQL function names and not looking at what you already named them)

nicklasaven commented 9 years ago

Yep, I follow

On Tue, 2015-04-28 at 13:10 -0700, Paul Ramsey wrote:

Once you accept the ideas of [ids] being before geometries, we move on to the next level which is that it's possible to do away with the special "group" type altogether. Since now a multi-point-with-ids looks just like type 20, the "homogeneous group", right?

— Reply to this email directly or view it on GitHub.

pramsey commented 9 years ago

We did this