Performance advice? - Githubissues

kylebarron commented 4 years ago

Hey again,

I'm working on figuring out how to profile the lambda function, but I wanted to also ask if you had suggestions for improving performance, e.g. fastest image file format, mosaicJSON setup, post processing options? I'm getting averages of 12-15 seconds for requests to NAIP imagery (using first pixel selection method), and I'd love to see if I can bring that down a bit.

A big performance boost seems to come from removing @2x (unsurprisingly). Removing @2x and setting the output format to jpg gives me ~2-3 second response times for the Landsat endpoint (with a pregenerated mosaicJSON), which I'm happy with, though NAIP times are still slower.

kylebarron commented 4 years ago

I've started to do some profiling using AWS X-Ray. I'm detailing my efforts here for future reference. Some mosaicJSON comments might be well suited for discussion in https://github.com/developmentseed/mosaicjson-spec.

From your blog posts, it seems that you're mostly working with mid-resolution imagery, so this might be the first attempt at using cogeo-mosaic-tiler with high resolution (<=1 meter) imagery.

Here's a basic profile of a 14-second tile load at zoom 13 from an NAIP mosaicJSON.

There's essentially two parts, loading the mosaicJSON and tiling the images.

MosaicJSON loading

Just downloading the gzipped MosaicJSON, decompressing, and parsing the JSON string takes 2.5 seconds.

My NAIP mosaicJSON, with quadkeys at zoom 12 and spanning the entire lower 48 U.S. states is 2.7MB gzipped and 64MB uncompressed. There are 143,000 individual quadkey keys inside tiles[].

Possibilities for performance improvement:

Use non-JSON file format. E.g. protobuf, which should have a smaller file size and be faster to parse than JSON.
Splitting the mosaicJSON into smaller geographic areas, maybe one for each state. Would make each mosaic smaller, but I'd have to figure out how to combine them on the fly
Remove redundant information from each quadkey value

Currently each string value is
```
s3://naip-visualization/mn/2013/100cm/rgb/49095/m_4909539_se_15_1_20130814.tif
```

`mosaic_tiler`

I haven't yet profiled individual tile loading, but that's the bulk of the time (10s). I'll probably change that to single threaded reading to be able to profile more in depth.

For this tile in particular, there are 4 individual assets, each of which is a Cogeo tiff of about 14MB.

NAIP imagery is in a regular lat/lon grid, so I might try to explore if performance improvements are possible when you know overlap is minimal, and there are no weird angles.

Possibilities for performance improvement:

Use higher quadkey zoom:

From spec

// The zoom value for the quadkey index. MUST be =< maxzoom.
// If quadkey_zoom is > minzoom, then on each tile request from zoom between 
// minzoom and quadkey_zoom, the tiler will merge each quadkey asset lists.
// The use of quadkey_zoom can be beneficial when dealing with a high number
// of files and a large area.

Addendum

AWS X-Ray is a bit tedious, because to get good data I have to add

xray_recorder.begin_subsegment('name')
...
xray_recorder.end_subsegment()

for each segment of interest, and I have to copy chunks of code from dependencies to profile inner functions.

kylebarron commented 4 years ago

Closing in favor of #4.

vincentsarago commented 4 years ago

I'm really sorry @kylebarron I didn't get the notification for thoses issues 🤦‍♂

vincentsarago commented 4 years ago

MosaicJSON loading

Just downloading the gzipped MosaicJSON, decompressing, and parsing the JSON string takes 2.5 seconds. My NAIP mosaicJSON, with quadkeys at zoom 12 and spanning the entire lower 48 U.S. states is 2.7MB gzipped and 64MB uncompressed. There are 143,000 individual quadkey keys inside tiles[].

Yeah that's a really big mosaicJSON, maybe you could split in multiple mosaicJSON with a master mosaic-json at zoom 11 (quadkey_zoom=11) referencing 4 zoom 12 mosaic-json files.

this is mostly supported here https://github.com/developmentseed/cogeo-mosaic/blob/master/cogeo_mosaic/utils.py#L478-L488

That's said there is yet no tool to create this master mosaic, and also we should maybe make https://github.com/developmentseed/cogeo-mosaic/blob/master/cogeo_mosaic/utils.py#L482 multithreaded.

Note: mosaicJSON has been designed with rapidly evolving dataset in mind (like having new data coming in every week) or for relatively small area (or not high resolution country wide). With NAIP, yes the dataset evolve but it is slowly changing, thus having static mosaic-json stored on AWS S3 following a qaudkey pattern could also be a solution (I've created worldwide mosaic with this solution).

basically instead of having a single mosaic-json with quadkey_zoom==12 you have an AWS S3 directory where you store mosaic-json for only one quadkey_zoom==12

s3://my-bucket/naip-mosaic/
      -- 000000000011.jzon.gz
      -- 000000221210.json.gz
      -- 0213213210133.json.gz
      ....

This is a full custom solution which is making AWS S3 behaving like a database (maybe it could work the same with DynamoDB).

Possibilities for performance improvement:

Use non-JSON file format. E.g. protobuf, which should have a smaller file size and be faster to parse than JSON.

👀

Splitting the mosaicJSON into smaller geographic areas, maybe one for each state. Would make each mosaic smaller, but I'd have to figure out how to combine them on the fly

Remove redundant information from each quadkey value

https://github.com/developmentseed/mosaicjson-spec/issues/1 Yeah, I though about that but didn't had time to deepdive, and was also affraid to add customize solution.

kylebarron commented 4 years ago

Thanks for taking the time to respond!

I'm essentially trying to use cogeo-tiling both for static and dynamic mosaics. For example, a seamless, cloudless Landsat mosaic that's a default basemap, with the option for a user to choose a specific imagery date range, which then creates a smaller-area mosaic on the fly. And at high zooms, NAIP would be entirely static.

I'd never looked into DynamoDB, but I think that's exactly what I'm looking for. A fast, serverless key-value store. The other options seem more "hacky".

I assume you're not interested in formalizing or adding code to support any solution other than the JSON file?

vincentsarago commented 4 years ago

The problem with DynamoDB is you are limited in value size, but this could be a great addition for sure. I think the value limit is 400k which should be enough to store a list of file.

I'm really noob in DB so I guess your help will be more than welcome. This would required a lot of refactoring, or maybe just add a db: case in https://github.com/developmentseed/cogeo-mosaic/blob/f8bc8e69e2d57cda6e138a60d86efb8677cf3f38/cogeo_mosaic/utils.py#L397-L488

(maybe base class mosaicjsonStorage, with s3, local, url, dynamodb subclass would also be good. each children class having https://github.com/developmentseed/cogeo-mosaic/blob/f8bc8e69e2d57cda6e138a60d86efb8677cf3f38/cogeo_mosaic/utils.py#L430-L443)

Also just keep in mind that mosaicJSON doesn't just hold tiles files but also other info like zoom, bounds ...

kylebarron commented 4 years ago

Right, I was thinking more something along the lines of a two-part system. DynamoDB is fast as a key-value store when you know the key you're looking for. You still need the MosaicJSON to tell you the quadkey_zoom level, so that you can query for the tile of interest in the DB.

So 1) grab a tiny MosaicJSON from S3. This should be <20 ms to fetch and parse. (Edit: I'm realizing that S3 latency when a file is cold can be in the 100+ ms range, but this file would probably get grabbed and cached a lot.)

{
    "mosaicjson": "0.0.2",
    "name": "compositing",
    "description": "A simple, light grey world.",
    "version": "1.0.0",
    "attribution": "<a href='http://openstreetmap.org'>OSM contributors</a>",
    "minzoom": 0,
    "maxzoom": 11,
    "quadkey_zoom": 0,
    "bounds": [ -180, -85.05112877980659, 180, 85.0511287798066 ],
    "center": [ -76.275329586789, 39.153492567373, 8 ]

    // OPTIONAL. A URL used to fetch values of quadkey
    "tile_quadkey_url" = "dynamodb://path/to/db/{quadkey}"
}

Now 2) you have the zoom level and the tile x/y/z, so generate the quadkey and fetch the value for that key within DynamoDB. The only data returned will be the 1-10 ish URL strings for that tile.

With this two-step process, it wouldn't be a lot of refactoring. I.e. just change https://github.com/developmentseed/cogeo-mosaic/blob/f8bc8e69e2d57cda6e138a60d86efb8677cf3f38/cogeo_mosaic/utils.py#L397-L421 to also take the value of quadkey, and if there's a tile_quadkey_url value in the MosaicJSON, fetch that URL and return.

vincentsarago commented 4 years ago

That's really interesting, let me sleep on this. We should document this before writing any code anyway.

kylebarron commented 4 years ago

I did a little more digging, and apparently using HTTP requests with DynamoDB is complicated. It's much more common to use an AWS SDK, which would mean that this solution would be much more custom, and less appealing to put into a spec.

developmentseed / cogeo-mosaic-tiler

Performance advice? #3

MosaicJSON loading

`mosaic_tiler`

Addendum

MosaicJSON loading