OvertureMaps / io-site

MIT License
31 stars 4 forks source link

[Draft] Manifest driven downloads #117

Open Bonkles opened 1 month ago

Bonkles commented 1 month ago

This PR attempts to improve our download times by reducing file count in two ways: 1) The use of pre-calculated manifests with bboxes per parquet file 2) omitting any non-visible themes from the download data.

The first item is accomplished from the manifests generated by @Bonkles/manifest-generator to speed up and improve our download experience.

These two bits of logic allow us to make refinements to the total catalog of files we might need to consider- greatly decreasing the number of files to consult.

So, when the user clicks the download button, we can massively pare down the # of HTTP requests/data loading required to assemble a valid catalog. If the user wants buildings, we'd previously have to send 4 HTTP requests per file, a total of about 800+ requests, before we could start downloading data.

Still to-do:

This PR is in draft form as I'm waiting for the new geoparquet reader code revamp to land, then I can really test.

H-Plus-Time commented 1 month ago

Two pieces of information would be super useful in the manifests:

  1. The serialized_size value of each file's FileMetaData - geoarrow-wasm can (with a minor tweak) forward that through to the with_footer_size_hint method of each file's reader instance. Effectively cuts 1 of the 4 requests (instant disk cache hit).
  2. The file size - strictly speaking object-store wants last_modified as well, but that's straightforward to fudge (the implementation in geoarrow-wasm doesn't pay attention to it). This should cut out the HEAD request entirely

It likely won't make a huge difference to this repo given the bounding box optimizations (and the CF distro of course), but a 50% cut in metadata requests is still nice (that and I reckon a bit of offline behaviour/speculative read-ahead is possible with that). Also assuming these manifests make their way out to general use, broad bounding boxes + non-spatial filters will benefit greatly.

Bonkles commented 1 month ago

Agreed @H-Plus-Time - the manifests are supposed to be general purpose info that helps everybody, so this is great info for me to consider. While I have this site in mind as a use case, I want to make sure useful info goes in for others.