Closed kylebarron closed 4 years ago
I've started to do some profiling using AWS X-Ray. I'm detailing my efforts here for future reference. Some mosaicJSON comments might be well suited for discussion in https://github.com/developmentseed/mosaicjson-spec.
From your blog posts, it seems that you're mostly working with mid-resolution imagery, so this might be the first attempt at using cogeo-mosaic-tiler
with high resolution (<=1 meter) imagery.
Here's a basic profile of a 14-second tile load at zoom 13 from an NAIP mosaicJSON.
There's essentially two parts, loading the mosaicJSON and tiling the images.
Just downloading the gzipped MosaicJSON, decompressing, and parsing the JSON string takes 2.5 seconds.
My NAIP mosaicJSON, with quadkeys at zoom 12 and spanning the entire lower 48 U.S. states is 2.7MB gzipped and 64MB uncompressed. There are 143,000 individual quadkey keys inside tiles[]
.
Possibilities for performance improvement:
Remove redundant information from each quadkey value
Currently each string value is
s3://naip-visualization/mn/2013/100cm/rgb/49095/m_4909539_se_15_1_20130814.tif
mosaic_tiler
I haven't yet profiled individual tile loading, but that's the bulk of the time (10s). I'll probably change that to single threaded reading to be able to profile more in depth.
For this tile in particular, there are 4 individual assets, each of which is a Cogeo tiff of about 14MB.
NAIP imagery is in a regular lat/lon grid, so I might try to explore if performance improvements are possible when you know overlap is minimal, and there are no weird angles.
Possibilities for performance improvement:
Use higher quadkey zoom:
From spec
// The zoom value for the quadkey index. MUST be =< maxzoom.
// If quadkey_zoom is > minzoom, then on each tile request from zoom between
// minzoom and quadkey_zoom, the tiler will merge each quadkey asset lists.
// The use of quadkey_zoom can be beneficial when dealing with a high number
// of files and a large area.
AWS X-Ray is a bit tedious, because to get good data I have to add
xray_recorder.begin_subsegment('name')
...
xray_recorder.end_subsegment()
for each segment of interest, and I have to copy chunks of code from dependencies to profile inner functions.
Closing in favor of #4.
I'm really sorry @kylebarron I didn't get the notification for thoses issues 🤦♂
MosaicJSON loading
Just downloading the gzipped MosaicJSON, decompressing, and parsing the JSON string takes 2.5 seconds. My NAIP mosaicJSON, with quadkeys at zoom 12 and spanning the entire lower 48 U.S. states is 2.7MB gzipped and 64MB uncompressed. There are 143,000 individual quadkey keys inside
tiles[]
.
Yeah that's a really big mosaicJSON, maybe you could split in multiple mosaicJSON with a master mosaic-json at zoom 11 (quadkey_zoom=11) referencing 4 zoom 12 mosaic-json files.
this is mostly supported here https://github.com/developmentseed/cogeo-mosaic/blob/master/cogeo_mosaic/utils.py#L478-L488
That's said there is yet no tool to create this master
mosaic, and also we should maybe make https://github.com/developmentseed/cogeo-mosaic/blob/master/cogeo_mosaic/utils.py#L482 multithreaded.
Note:
mosaicJSON has been designed with rapidly evolving
dataset in mind (like having new data coming in every week) or for relatively small area (or not high resolution country wide). With NAIP, yes the dataset evolve but it is slowly changing, thus having static mosaic-json stored on AWS S3 following a qaudkey pattern could also be a solution (I've created worldwide mosaic with this solution).
basically instead of having a single mosaic-json with quadkey_zoom==12
you have an AWS S3 directory where you store mosaic-json for only one quadkey_zoom==12
s3://my-bucket/naip-mosaic/
-- 000000000011.jzon.gz
-- 000000221210.json.gz
-- 0213213210133.json.gz
....
This is a full custom solution which is making AWS S3 behaving like a database (maybe it could work the same with DynamoDB).
Possibilities for performance improvement:
- Use non-JSON file format. E.g. protobuf, which should have a smaller file size and be faster to parse than JSON.
👀
Splitting the mosaicJSON into smaller geographic areas, maybe one for each state. Would make each mosaic smaller, but I'd have to figure out how to combine them on the fly
Remove redundant information from each quadkey value
https://github.com/developmentseed/mosaicjson-spec/issues/1 Yeah, I though about that but didn't had time to deepdive, and was also affraid to add customize solution.
Thanks for taking the time to respond!
I'm essentially trying to use cogeo
-tiling both for static and dynamic mosaics. For example, a seamless, cloudless Landsat mosaic that's a default basemap, with the option for a user to choose a specific imagery date range, which then creates a smaller-area mosaic on the fly. And at high zooms, NAIP would be entirely static.
I'd never looked into DynamoDB, but I think that's exactly what I'm looking for. A fast, serverless key-value store. The other options seem more "hacky".
I assume you're not interested in formalizing or adding code to support any solution other than the JSON file?
The problem with DynamoDB is you are limited in value size, but this could be a great addition for sure.
I think the value
limit is 400k which should be enough to store a list of file.
I'm really noob in DB so I guess your help will be more than welcome. This would required a lot of refactoring, or maybe just add a db:
case in https://github.com/developmentseed/cogeo-mosaic/blob/f8bc8e69e2d57cda6e138a60d86efb8677cf3f38/cogeo_mosaic/utils.py#L397-L488
(maybe base class mosaicjsonStorage, with s3, local, url, dynamodb subclass would also be good. each children class having https://github.com/developmentseed/cogeo-mosaic/blob/f8bc8e69e2d57cda6e138a60d86efb8677cf3f38/cogeo_mosaic/utils.py#L430-L443)
Also just keep in mind that mosaicJSON doesn't just hold tiles
files but also other info like zoom, bounds ...
Right, I was thinking more something along the lines of a two-part system. DynamoDB is fast as a key-value store when you know the key you're looking for. You still need the MosaicJSON to tell you the quadkey_zoom
level, so that you can query for the tile of interest in the DB.
So 1) grab a tiny MosaicJSON from S3. This should be <20 ms to fetch and parse. (Edit: I'm realizing that S3 latency when a file is cold can be in the 100+ ms range, but this file would probably get grabbed and cached a lot.)
{
"mosaicjson": "0.0.2",
"name": "compositing",
"description": "A simple, light grey world.",
"version": "1.0.0",
"attribution": "<a href='http://openstreetmap.org'>OSM contributors</a>",
"minzoom": 0,
"maxzoom": 11,
"quadkey_zoom": 0,
"bounds": [ -180, -85.05112877980659, 180, 85.0511287798066 ],
"center": [ -76.275329586789, 39.153492567373, 8 ]
// OPTIONAL. A URL used to fetch values of quadkey
"tile_quadkey_url" = "dynamodb://path/to/db/{quadkey}"
}
Now 2) you have the zoom level and the tile x/y/z, so generate the quadkey and fetch the value for that key within DynamoDB. The only data returned will be the 1-10 ish URL strings for that tile.
With this two-step process, it wouldn't be a lot of refactoring. I.e. just change https://github.com/developmentseed/cogeo-mosaic/blob/f8bc8e69e2d57cda6e138a60d86efb8677cf3f38/cogeo_mosaic/utils.py#L397-L421 to also take the value of quadkey
, and if there's a tile_quadkey_url
value in the MosaicJSON, fetch that URL and return.
That's really interesting, let me sleep on this. We should document this before writing any code anyway.
I did a little more digging, and apparently using HTTP requests with DynamoDB is complicated. It's much more common to use an AWS SDK, which would mean that this solution would be much more custom, and less appealing to put into a spec.
Hey again,
I'm working on figuring out how to profile the lambda function, but I wanted to also ask if you had suggestions for improving performance, e.g. fastest image file format, mosaicJSON setup, post processing options? I'm getting averages of 12-15 seconds for requests to NAIP imagery (using
first
pixel selection method), and I'd love to see if I can bring that down a bit.A big performance boost seems to come from removing
@2x
(unsurprisingly). Removing@2x
and setting the output format tojpg
gives me ~2-3 second response times for the Landsat endpoint (with a pregenerated mosaicJSON), which I'm happy with, though NAIP times are still slower.