SymbolixAU / flatgeobufr

6 stars 0 forks source link

What's the point? #1

Open SymbolixAU opened 4 years ago

SymbolixAU commented 4 years ago

Why does this library exist, what should it do?

tim-salabim commented 4 years ago

As I see it, a performant stand-alone (i.e. GDAL independent) driver for reading and especially writing flatgeobuf files from R makes sense because:

The V8 based approach gets real slow real quick when writing large data. So my hope is that via C++ we can speed things up drastically.

dcooley commented 4 years ago

You make a convincing argument.

In terms of the design, these are what I consider the core functionality

  1. export sf object to flatgeobuf for use in javascript libraries
  2. read flatgeobuf and covert to sf object
  3. read flatgeobuf and pipe directly to javascript libraries
  4. read geojson and export to flatgeobuf

I've ordered these by what I think would be the priority to implement, with 3 & 4 possibly redundant, but may be nice.

tim-salabim commented 4 years ago

That sounds like a good plan. What about geometry only data, i.e. sfc?

SymbolixAU commented 4 years ago

yes I image we'd support sfg, sfc and sf.

I think the first step is to understand how serialize() works, and see if we can adapt it to work on an sf object (at the moment I think it's set up for a mapbox::geometry::feature object).

Then we need to pass the result geobuffer from R to javascript - have you done this in your leaflet demo, or did you read the buffer directly into the javascript library?

tim-salabim commented 4 years ago

I read it directly. The way this is currently implemented is:

  1. copy fgb file to a tempfile (https://github.com/r-spatial/leafem/blob/master/R/file.R#L310-L314) - this is where the writing would happen if it was an sf object
  2. attach tempfile to htmlwidget (https://github.com/r-spatial/leafem/blob/master/R/file.R#L446-L455)
  3. on the JS side find the attachment (https://github.com/r-spatial/leafem/blob/master/inst/htmlwidgets/lib/FlatGeoBuf/fgb.js#L11)
  4. fetch and stream the result onto the map (https://github.com/r-spatial/leafem/blob/master/inst/htmlwidgets/lib/FlatGeoBuf/fgb.js#L77-L78)

So as it is currently implemented. I would write the whole file once and then stream it in chunks onto the map. I am not sure if it is possible to stream directly from R to JS...

tim-salabim commented 4 years ago

Maybe @bjornharrtell can provide some input as to how serialize works and what would be the best path?

bjornharrtell commented 4 years ago

Indeed serialize is implemented as a reference/example and using mapbox types for feature, geometry and geojson representations because that's what I found most available/accessible as generic implementations of such things in the C++ world. It is also not a streaming API like I have in the reference JavaScript implementation. I'm not good enough with C++ to determine what a good streaming API and implementation should look like (see https://github.com/bjornharrtell/flatgeobuf/issues/24).

Serialize simply takes a mapbox feature collection (a "table" of geometry and attribute data) and serializes into a flatgeobuf binary. This process involves writing a header, index and the features. The index should be optional but is not implemented yet (but is in the GDAL driver). The index makes it possible to do fast filtering on individual feature or spatial bounding box without reading the whole dataset but requires more time to write and cannot be serialized as a stream because it's a two pass process. The use case for having no index is when the the whole dataset is to be consumed at the client side without any filtering and can then be serialized as a stream.

It sounds like you really want to serialize from something else (I don't know what these sfg, sfc and sf are) and considering why you are doing this in the first place I would agree it's the way to go and might not even be more difficult to adapt the code to your types than trying to map your stuff to the mapbox types. If your types are general purpose in the R community and easily imported into my repository I could definitely consider adopting support for them in the reference C++ impl.

SymbolixAU commented 4 years ago

It sounds like you really want to serialize from something else (I don't know what these sfg, sfc and sf are)

Yes. And our sf structures are very closely aligned with mapbox::geometry objects, they too are C++ structures. So I think serialising these is entirely possible and the way to go.

than trying to map your stuff to the mapbox types.

Given everything in the R world is also in the C++ world, the mapping is straight forward - done here (but I have a suspicion it can be improved by re-assigning the container type, rather than re-populating a new container)


From my first reading of serialize, my understanding is the bulk of it is creating vectors of each 'column' of the tabular structure, both the geometry coordinates and associated meta (which is what I do here in sfheaders).

From then on comes the magic, which I need to read through and understand what it's doing to construct the buf object.


And for some context


And another aside, I'll ask the guys at deck.gl to consider supporting flatgeobuf directly too.

bjornharrtell commented 4 years ago

Reading geojson.h again makes me less than proud, I see it has several issues and shortcomings which has went unfixed because my attention went into the GDAL driver implementation. Some things to note:

  1. fbb.FinishSizePrefixed(feature) is part of flatbuffers API, see http://google.github.io/flatbuffers/classflatbuffers_1_1_flat_buffer_builder.html#a7ba8462e408431054c99d25120326220. Note that in FlatGeobuf, the header is an individual flatbuffer and each feature is an individual flatbuffer.
  2. I see that to get the final binary I concatenate vectors and copy into new buffer, so both inefficient and RAM hungry. (final copy done with memcpy at https://github.com/SymbolixAU/flatgeobufr/blob/df9a97a09288e1094199898ceea661f0f49c0b06/src/geojson.h#L201)
  3. The code creates a spatial tree index but it seems unused in the tests (and indeed I have no filtering support for deserialize) and it will be corrupt unless the input features are correctly sorted.

So, unfortunately I would say the C++ ref impl. is in pretty bad shape right now. I hope to be able to improve it in a not too far future.

bjornharrtell commented 4 years ago

With https://github.com/bjornharrtell/flatgeobuf/pull/27 merged the flatgeobuf C++ ref impl. is now hopefully in a better shape. Naive non-streaming serialize and deserialize are still available and with improved efficiency and correctness. The streaming variants works by the caller defining read/write lambda functions as arguments. Let me know what you think and if you have any questions.

SymbolixAU commented 4 years ago

Thanks @bjornharrtell . I'll have a look over the next few days to see how we can make use of it.

SymbolixAU commented 4 years ago

@tim-salabim (@robertleeplummerjr) I saw you ping me (as my alter-ego) on this thread, and the conversation is directed towards typed javascript arrays for Deck.gl.

This is why I haven't progressed this (flatgeobufr) idea much, because I'm leaning towards the view that we can avoid the sf -> geojson conversion, and go direct to a columnar (array) format.

Hence my development of sfheaders::sf_to_df(). This gives us all the data in columns, and every coordinate has an attribute associated with it (if you use fill = TRUE).

Then these columns can be packaged up into a buffer / protobuf and sent over as arrays, and unpacked as arrays directly, without any geojson conversion.

We can certainly use the ideas behind flatgeobuf to do the packing into a buffer.

bjornharrtell commented 4 years ago

Note that flatbuffers (on which flatgeobuf is based on) doesn't need encoding / decoding, it's a aligned memory structure that can be memory mapped and accessed directly. Less compact than protobuf but much faster.

tim-salabim commented 4 years ago

@SymbolixAU @bjornharrtell @robertleeplummerjr this is all beyond my skills, hence the ping to more skilled folks like you.

FWIW, I love the idea of packing things up into columnar data (R leaflet does this via d3.data.frames I think) and sending it to the JS methods. GeoJSON is a very restricted format in my opinion and does have much redundant information. I mean, we can make sure that all features are of the same type before sending data, so there's no need to store the type for each feature...

@robertleeplummerjr is accepting (what I think are) flat arrays for point features in Leaflet.Glify already, so maybe we could leverage a similar approach for polygons and lines too?

tim-salabim commented 4 years ago

To be clear @SymbolixAU , you first convert sf -> df with sfheaders and then parse to JSON to send to JS?

bjornharrtell commented 4 years ago

Not sure if it's already a given, but you can fetch FlatGeobuf in the (modern) browser as binary and use as is without any conversion whatsoever if you can use the memory model directly (fx. access the coordinates directly as typed array). The only reason I convert to GeoJSON in the reference implementation is because it's easy to consume in various clients. I also have a conversion directly to OpenLayers which is more efficient but still a conversion to adapt to the expected memory model of an OpenLayers Feature.

One way to do it fully async and streaming is using the Fetch API and read the body as [ReadableStream]() (will get the data as ArrayBuffer chunks).

SymbolixAU commented 4 years ago

you can fetch FlatGeobuf in the (modern) browser as binary and use as is without any conversion whatsoever if you can use the memory model directly (fx. access the coordinates directly as typed array)

yeah this is my plan.

The only reason I convert to GeoJSON in the reference implementation is because it's easy to consume in various clients

I suspected this is the case. Since Tim and I will have control of the R and JS code to both send and receive the data, we can be stricter about the format.


you first convert sf -> df with sfheaders and then parse to JSON to send to JS?

This is where I am going, yes. I've done this for POINT geometries, but the others still rely on sf -> geojson. And the "parse to JSON" step is only converting each column of a data.frame into an array.

kylebarron commented 4 years ago

And another aside, I'll ask the guys at deck.gl to consider supporting flatgeobuf directly too.

I created an issue in loaders.gl which is a project to load serialization formats into memory to be used with Deck.gl: https://github.com/uber-web/loaders.gl/issues/716.

As an aside; @SymbolixAU, there's discussion at https://github.com/geopandas/geo-arrow-spec on using Arrow as a cross-language geospatial in-memory data format. Since you're knowledgeable about the R geospatial ecosystem, your comments would be appreciated. Deck.gl also plans to support using Arrow tables internally: https://github.com/uber/deck.gl/issues/3758. Since there's a lot of cross-language community interest in Arrow, I'm looking to see if the community can standardize on a single specification.

SymbolixAU commented 4 years ago

Thanks @kylebarron - I've pinged a couple of others on twitter who will have an interest too (in the arrow spec).

Since there's a lot of cross-language community interest in Arrow, I'm looking to see if the community can standardize on a single specification.

a noble goal for sure.

SymbolixAU commented 4 years ago

Keeping this up to date, I'm adding an interleave capability to sfheaders which outputs the format

sfheaders:::rcpp_interleave_sf( mapdeck::roads[1:2, ] )
$coordinates
 [1] 145.01429 -37.83046 145.01434 -37.83057 145.01449 -37.83070 145.01599 -37.83148 145.01648 -37.83170 145.01681 -37.83175
[13] 145.01712 -37.83174 145.01750 -37.83167 145.01784 -37.83156 145.01835 -37.83138 145.01860 -37.83133 145.01890 -37.83130
[25] 145.01914 -37.83130 145.01943 -37.83133 145.01973 -37.83138 145.02020 -37.83146 145.02055 -37.83154 145.02064 -37.83159
[37] 145.02075 -37.83159 145.02099 -37.83166 145.01502 -37.83083 145.01556 -37.83113 145.01629 -37.83146 145.01637 -37.83150
[49] 145.01650 -37.83155 145.01659 -37.83157 145.01668 -37.83159 145.01675 -37.83160 145.01689 -37.83162 145.01696 -37.83162
[61] 145.01706 -37.83162 145.01715 -37.83162 145.01729 -37.83160 145.01739 -37.83158 145.01752 -37.83154 145.01817 -37.83132
[73] 145.01834 -37.83127 145.01848 -37.83125 145.01863 -37.83122 145.01881 -37.83121 145.01896 -37.83120 145.01914 -37.83121
[85] 145.01933 -37.83123 145.01951 -37.83126 145.02090 -37.83155 145.02096 -37.83157

$start_indices
[1]  0 20

$n_coordinates
[1] 20 26

$total_coordinates
[1] 46

$stride
[1] 2

$data
... omitted ...

I'm currently keeping this an internal function (so not exposing it for use by other packages yet) because I'm not sure it will stay in this package (as it's not strictly anything to do with sf), nor on the actual structure of the data yet.

This format also aligns in with the "nested list" suggestion on the geo-arrow thread.

I see this a more appropriate format for sending into a buffer, rather than going sf -> df -> buffer, or sf -> geojson -> buffer

tim-salabim commented 4 years ago

Makes sense to me. Seems more memory efficient than a df.

This format also aligns in with the "nested list" suggestion on the geo-arrow thread.

If I read your layout correctly, it's more like the 'flat list' layout the talk about in the linked issue, isn't it?

Are you still thinking about encoding these lists into a flatgeobuf or do you have a different buffer format in mind?

SymbolixAU commented 4 years ago

sorry yes, "flat list" is what I meant.

Are you still thinking about encoding these lists into a flatgeobuf or do you have a different buffer format in mind?

Yes this is still my intention. But, I'm following the geo-arrow, geoPandas and visgl loaders and deckgl arrow support conversations to see where they go too.

kylebarron commented 4 years ago

Deck.gl works with 3D coordinates often, and we need those coordinates to be in one flat array, so I doubt we'll adopt FlatGeobuf internally, but deserialization support is possible/likely sometime in the future.

SymbolixAU commented 4 years ago

oh I didn't realise flatgeobuf was constrained to XY only.

kylebarron commented 4 years ago

It's not, but it stores Z coordinates as a separate array https://github.com/bjornharrtell/flatgeobuf/blob/85ef2682f635bafa8533ecc12afa63200135e7ea/src/fbs/feature.fbs#L7-L10

SymbolixAU commented 4 years ago

Right. I now understand. Thanks for bring me up to speed :)

tim-salabim commented 4 years ago

Yes this is still my intention. But, I'm following the geo-arrow, geoPandas and visgl loaders and deckgl arrow support conversations to see where they go too.

Wow, thanks for the link to the visgl/loaders repo. This looks very interesting and useful!

mdsumner commented 4 years ago

fwiw @SymbolixAU that interleave_sf above could act as a drop-in-replace for silicate::sc_path and silicate::sc_coord, and in another direction to mesh3d (with a stride of 4 and h = 1). I'm sure you're aware ... but it's awesome to me that the level of generality you are targetting will have so many uses. You've seen better than anyone the required set of low level tools.

dcooley commented 4 years ago

yeah I was thinking about silicate too and saw there might be some overlap, but was going to flesh out the design of {interleave} (which I will now make as its own package) first.

(edit: currently signed-in as me, dcooley, not other me symbolixau)