Switch from flatgeobuffer to geomedea for GTFS

dabreegster commented 2 months ago

CC @michaelkirk, I'm trying out geomedea for the use case I described in Discord!

Metric	flatgeobuffer	geomedea
File size	99MB	53MB
Bristol	3.6MB in 23 requests	5MB in 20 requests
Elephant & Castle	6.4MB in 935 requests, 1.76 minutes	9.4MB in 24 requests, 8.3s

Bristol doesn't have many GTFS trip shapes intersecting the area, while E&C in London has loads.

Unless I'm measuring something wrong, the current approach with geomedea incurs more bandwidth, but through way less requests and latency.

michaelkirk commented 2 months ago

Unless I'm measuring something wrong, the current approach with geomedea incurs more bandwidth, but through way less requests and latency.

It's not entirely surprising that geomedea might request more data.

In FGB, there is a single buffer of uncompressed features. In FGB, since there is no compression, the index tells us exactly where each feature is in the file. Using this I implemented smart feature batching, so feature requests will only merge adjacent features into a single request if they are "close enough".

To take advantage of compression, geomedea groups features into pages, so you have to download an entire page even if you only need one feature in the page. Because geomedea's features are in compressed pages, request batching would be a little different. It can still be done, but I guess it'd be "page batches" rather than "feature batches". I haven't implemented this yet, but it should be doable in a non-breaking way.

michaelkirk commented 2 months ago

Could you do me a favor?

RUST_LOG=debug

And give me the lines matching: Finished using an HTTP client. used_bytes

e.g.:

Finished using an HTTP client. used_bytes=839712, wasted_bytes=293690, req_count=4
Finished using an HTTP client. used_bytes=17, wasted_bytes=0, req_count=1

wasted_bytes should correspond to the bytes that could be gained by having more clever page-batching.

michaelkirk commented 2 months ago

wasted_bytes should correspond to the bytes that could be gained by having more clever page-batching.

I had a go at "more clever page-batching" here: https://github.com/michaelkirk/geomedea/pull/12

michaelkirk commented 2 months ago

I was looking at the network traffic for your existing FGB integration - and I feel like there must be a bug in the FGB client. It makes no sense for all those small nearby requests (4 bytes?!).

I'm looking into that now.

dabreegster commented 2 months ago

After updating to the latest 417d4f43cd35aa98aea19a0b17632c8309b50466:

Bristol reads 2.6MB over 25 requests, total 3.6s (time measured from a perfectly fast localhost -- I could also try yocalhost or on a real wifi connection to cloudflare or something)
- I see 3 logs: Finished using an HTTP client. used_bytes=156716, wasted_bytes=0, req_count=5
- Finished using an HTTP client. used_bytes=3717536, wasted_bytes=1570173, req_count=19
- Finished using an HTTP client. used_bytes=17, wasted_bytes=0, req_count=1
Elephant & Castle reads 6.5MB also over exactly 25 requests
- Finished using an HTTP client. used_bytes=352492, wasted_bytes=0, req_count=10
- Finished using an HTTP client. used_bytes=5062757, wasted_bytes=2614402, req_count=37
- Finished using an HTTP client. used_bytes=17, wasted_bytes=0, req_count=1

These two cases are now competitive with fgb, so I'm almost definitely going to switch to this. :)

dabreegster commented 2 months ago

With the new property encoding...

Elephant reads 6.3MB over 23 requests. Finished using an HTTP client. used_bytes=156716, wasted_bytes=0, req_count=5 Finished using an HTTP client. used_bytes=3776001, wasted_bytes=1490472, req_count=17 Finished using an HTTP client. used_bytes=17, wasted_bytes=0, req_count=1

Bristol reads 2.5MB over 25 requests Finished using an HTTP client. used_bytes=144956, wasted_bytes=1344, req_count=9 Finished using an HTTP client. used_bytes=1078601, wasted_bytes=473493, req_count=15 Finished using an HTTP client. used_bytes=17, wasted_bytes=0, req_count=1

So the new encoding is not giving that huge of an advantage, but still opens the way to doing something nicer later with delta encoding.

I'm going to merge this in now and continue to play with encoding / perf later on. It's a huge improvement with low work, so thanks so much for the new format, adding WASM support, and these page batching fixes!

michaelkirk commented 1 month ago

Here's Elephant & Castle with https://github.com/flatgeobuf/flatgeobuf/pull/376

tldr; there was a bad bug in the http fetch implementation, triggered by those 1.05MB requests. It hadn't came up in the shape of my own data and requests, so thanks for helping to uncover it.

With the bug fix, the two formats seem to be in the same ballpark of network transfer for your queries.

edit for completeness, here's the same with geomedea (one more request, 15% less bytes transferred):

a-b-street / 15m

Switch from flatgeobuffer to geomedea for GTFS #8