Closed kylebarron closed 4 years ago
Thanks @kylebarron this is really (really) appreciated.
Earlier last month I started to run some benchmark (using pytest-becnhmark) and also profile some function to see where we will be able to win some seconds because yes when using multiple assets for a tile it can be slow.
e.g. I found that using multithreading (even with only one concurrent thread) was way slower than using a simple function call. But when having more the 3 assets it was then faster (but you still rely on the internet connection speed).
About full tile/partial tile, I'm not convinced this will make a difference to be honest (But I'll love to be wrong). In rio-tiler, we added benchmark, where we test multiple data type, nodata, but also difference between full and partial tile (https://github.com/cogeotiff/rio-tiler/blob/v2/tests/benchmarks/__init__.py#L62-L95)
The results (not sure if you can see the numbers in https://app.circleci.com/pipelines/github/cogeotiff/rio-tiler/18/workflows/431270d7-1e33-4681-ac1d-d1f094096b29/jobs/1017) shows that partial read is faster then full read (while they both return 256x256 arrays.
---------------------------- benchmark 'boundless tile ': 32 tests -----------------------------
Name (time in ms) Min Max Mean Median
------------------------------------------------------------------------------------------------
int8-none 10.9991 (1.0) 15.2086 (1.0) 11.6661 (1.0) 11.2810 (1.00)
------------------------------- benchmark 'full tile ': 32 tests -------------------------------
Name (time in ms) Min Max Mean Median
------------------------------------------------------------------------------------------------
uint8-none 12.7991 (1.0) 19.5376 (1.12) 13.8260 (1.01) 13.1201 (1.0)
GDAL is smarter enough to fetch only what it needs thus, partial tile should always be faster then full tile. After yes it means we need to deal with multiple 256x256 arrays but I don't think this slows downs the process is a matter of seconds (I could be wrong).
Other note:
with rio-tiler v2.0 we are using a new calculate_default_transform
(from TerraCotta) which should improve the performance in get_vrt_transform
.
I'm also surprised vrt.read takes almost the same tile and I wonder if there is not something else going on (with GDAL or Rasterio)
About full tile/partial tile, I'm not convinced this will make a difference to be honest (But I'll love to be wrong)
Well given the simple benchmark in https://github.com/cogeotiff/rio-tiler/pull/168, I'm more likely to be wrong. π
When I saw that it was taking 1.7s to read every asset, I assumed that it was due to the number of pixels read/generated, but it appears that's a weak correlation.
Most likely, you're right that I underestimated GDAL. It does make sense that GDAL would optimize internally to fill nodata
s quickly when it knows it's reading outside the bounds of the source image.
we test multiple data type, nodata, but also difference between full and partial tile
Is full
full overlap between the mercator tile and the asset, mask
is no overlap, and boundless
is partial overlap?
I'm also surprised vrt.read takes almost the same tile and I wonder if there is not something else going on (with GDAL or Rasterio)
I'm still stumped about that. It seems to me like 1.7 seconds on a call to vrt.read
is an awfully long time. Especially when my local tests were ~ 26.8 ms.
I also tested the same simple read benchmark in https://github.com/cogeotiff/rio-tiler/pull/168 on a small EC2 instance with 1 CPU in us-west-2
-- the same region as the COG images -- and set address
to the S3 path, and read times were ~70ms!
In [8]: address = 's3://naip-visualization/ca/2018/60cm/rgb/34118/m_3411861_ne_11_060_20180723_20190208.tif'
In [9]: %timeit tile(address, x, y, z, tilesize)
72.1 ms Β± 3.49 ms per loop (mean Β± std. dev. of 7 runs, 10 loops each)
So why is that 20x slower on Lambda? π€
Is full full overlap between the mercator tile and the asset, mask is no overlap, and boundless is partial overlap?
Full
means mercator tile full overlapping,
boundless
means partial read (reading over the bounds of the file)
I'm still stumped about that. It seems to me like 1.7 seconds on a call to vrt.read is an awfully long time. Especially when my local tests were ~ 26.8 ms.
~Network + S3 + requester-pays bucket.~ (edits because you were testing the same file from ec2 and lambda, which should be similar then)
Also Caching might speedup things in EC2.
Can you wrap the rio-tiler call into a multithreading block on your ec2 ?
Also Caching might speedup things in EC2.
I think you're right that caching is explaining the difference here, but I don't know where the caching is happening. Is it on the EC2 side or the S3 side?
%timeit
runs the provided code many times and averages among them. I came back after ~1 hour and ran it again with a single %time
.
In [3]: %time data, mask = tile(address, x, y, z, tilesize)
CPU times: user 77 ms, sys: 8.43 ms, total: 85.4 ms
Wall time: 782 ms
Successive runs were much faster, and after a couple loads, when the COG is presumably fully-cached, down to double-digit ms.
In [24]: %time data, mask = tile(address, x, y, z, tilesize)
CPU times: user 35 ms, sys: 0 ns, total: 35 ms
Wall time: 38 ms
Testing the full tile response time for single threading vs multithreading, multithreading was slower for every tile. These are dev tools network screenshots for the tiles that load in the default view here in SF.
Single threaded, i.e.
for asset in assets:
t, m = tiler(asset, tile_x=tile_x, tile_y=tile_y, tile_z=tile_z, **kwargs)
t = numpy.ma.array(t)
t.mask = m == 0
pixel_selection.feed(t)
if pixel_selection.is_done:
return pixel_selection.data
Multi threaded, i.e.
with futures.ThreadPoolExecutor(max_workers=max_threads) as executor:
future_tasks = [executor.submit(_tiler, asset) for asset in assets]
for t, m in _filter_futures(future_tasks):
t = numpy.ma.array(t)
t.mask = m == 0
pixel_selection.feed(t)
if pixel_selection.is_done:
return pixel_selection.data
I didn't realize it was possible to do this, but on the latter it still profiled the tile load within the threads (note: page is zoomed out, bars towards the bottom are misaligned). Calling vrt.read
on each executor takes 7-8 seconds!
Here's the modified source I'm profiling with... The only call within the vrt.read
named segment is vrt.read
itself.
xray_recorder.begin_subsegment('vrt.read')
data = vrt.read(
out_shape=out_shape,
indexes=indexes,
window=out_window,
resampling=Resampling[resampling_method],
)
xray_recorder.end_subsegment()
The only reference within rasterio
to multithreading that I could find is in reference to a CPU bound function. Not sure if it's possible to optimize for a network-bound function. I'll have to read up more on the difference between asyncio
and multithreading to see if there could be some optimizations there.
I love all of the benchmarks! I've been writing an asyncio
based COG tiler and this thread inspired me to benchmark its performance on mosaic datasets against the current implementation.
The asyncio
implementation is very similar to what currently happens in rio_tiler_mosaic
, but rather than implementing concurrency through concurrency.futures.ThreadPoolExecutor
, it wraps rio_tiler.main.tile
in a dummy coroutine and uses await asyncio.gather(*coroutines)
.
For test data, I took the NAIP image @kylebarron has been using for his benchmarks (https://github.com/cogeotiff/rio-tiler/pull/168) and uploaded it to an S3 bucket using 50 different keys to replicate the workflow of requesting the same tile over many different images. This also prevents GDAL from caching the file handler during the benchmark.
Because utils.tile_read
seems to be taking up the majority of the request, I only benchmarked the logic to map a tile read across an array of images using both methods. To keep things similar, I used an EC2 instance with a single cpu (t2 small) and used %timeit
to record times (7 runs, 1 loop each).
I ran a few benchmarks:
Here are the results in some very poorly formatted matplotlib diagrams:
Because the instance is pretty small, we pretty quickly become IO bound at higher concurrencies, here are the network logs from the instance:
A few reasons why the asyncio
implementation outperforms:
Wow. I was about to give up after a full-day of research.
I'm impressed with the asyncio performance despite being "above" rasterio. I.e. I was thinking that to get performance gains you'd have to rewrite a portion of rasterio or GDAL to support async, and reading Even's thoughts made me pessimistic.
Edit: in particular, I'm curious how asyncio knows when rasterio is making a network request, and thus when to pass control back to the event loop.
I've never worked with asyncio
in Python before, do you have a code snippet you can share that I can test instead of the current threading implementation?
I'd try to publish an asyncio version on lambda and see how the benchmarks go
I'm still working on trying to set up the coroutines correctly... This is my most recent effort but I don't think it's correctly returning the data
async def tiler_coroutine(*args, **kwargs):
return _tiler(*args, **kwargs)
async def main():
return await asyncio.gather(*[tiler_coroutine(asset) for asset in assets])
tasks = asyncio.run(main())
@kylebarron I realized that's there might be something else going on. Can you try setting GDAL_DISABLE_READDIR_ON_OPEN=TRUE
here https://github.com/developmentseed/cogeo-mosaic-tiler/blob/master/serverless.yml#L48
It doesn't really seem like it makes a difference
Multithreaded, GDAL_DISABLE_READDIR_ON_OPEN
set to YES
Multithreaded, GDAL_DISABLE_READDIR_ON_OPEN
set to EMPTY_DIR
Yes sorry I've hided the comment because after another check I figured it won't change anything π€¦ββ
I seem to have "working" async code:
async def tiler_coroutine(*args, **kwargs):
try:
return _tiler(*args, **kwargs)
except Exception:
return None
async def main():
return await asyncio.gather(*[tiler_coroutine(asset) for asset in assets])
tasks = asyncio.run(main())
tasks = [t for t in tasks if t is not None]
it works in the sense that it computes correctly and I can see the output image, but with the tracing I can see it's not concurrent:
So I'm not sure what I'm doing wrong. @geospatial-jeff any advice?
tl;dr
: Use GDAL 2.4 and all your problems will be solved π
cc @geospatial-jeff
vrt.read
dropped down to 60-120ms (from 1.7s) just by switching back to GDAL 2.4. This is without using asyncio.
Also, I found that exporting a @2x JPEG is about ~65ms, while a @2x PNG is about ~400ms, and the JPEG is additionally like ~100KB instead of ~500KB for the PNG, so it's a win-win.
The following is a description of a possible bottleneck and a proposal which, if I'm right, could shave 9 seconds off of tile merging in my described use case.
I've spent a lot of the evening reading
cogeo*
/tiler*
code, and I think I may have discovered the largest current bottleneck, at least with my dataset: a full mercator tile's data image bitmap and mask is created for each asset, regardless of the amount of overlap.I think that's a big performance hit when you have several underlying images. In my
mosaic_tiler
profiling towards the bottom of this post, you can see that the biggest time hit isvrt.read
, i.e. creating the data arrays. It's ~1.8s per asset, regardless of the amount of overlap.Here's an example of where that really hurts. For a web mercator tile at zoom 12 covering West Hollywood (blue), there are six assets that are required to load (brown). Despite using a simpler pixel selection method, i.e.
FirstMethod
, all 6 assets need to be loaded and parsed, even though only 1 covers a majority of the tile.Note, however, that since there's essentially zero overlap between assets, if rasterio
windows
could be used to only read the subset of the tile with valid source , and ifvrt.read
is linear with the amount of pixels generated, then you could potentially save 9 seconds, or 83% faster on this tile load. (I.e. there are currently 5 extra sets of 512x512vrt.read
operations beyond the one that's needed, and 1.8s * 5 = 9.)6 underlying assets is the median for my MosaicJSON, so this is a very common occurrence. Here's the distribution of how many assets are in each quadkey:
Proposal
The slowest part of
mosaic_tiler
is creating data arrays that line up with the mercator tile for each underlying asset. I propose to userasterio
'swindows
to create a data array and mask using the minimal bounding box of valid source data within the mercator tile of interest.Take my intial contrived/sketched example above, where an asset's bounds overlap just the top left of the mercator tile. In that case, since the overlap is rectangular in mercator coordinates, the window could read just the overlapping asset data and return an object like the following:
data
: a numpy ndarray of size 3 x overlapping mercator width x overlapping mercator height. Pixel values for each band of only the parts of the mercator tile where source data exists.mask
: a numpy ndarray of size 1 x overlapping mercator width x overlapping mercator height. In the case where the overlap is rectangular in mercator coordinates, this would be entirelyTrue
.mercator_bounds
:[0, 400, 100, 512]
. The lower left and upper right corners in mercator tile coordinates. This would be used in the pixel merging code once each tile's data has loaded.I believe finding this intersection of valid data would be possible by intersecting the georeferenced bounds of the asset with the bounds of the mercator tile.
In the more general case (e.g. of Landsat data), where the intersection of the image scene and the mercator tile is not rectangular in mercator coordinates, you could still take the minimal intersecting bounding box in mercator coordinates. In this case, the
mask
is used to filter out the pixels without source data.In addition to potentially taking much less time, it would take much less memory and could be run on a smaller lambda instance.
Code to update
_tile_read
inrio_tiler.utils
with the option to return a rectangular subset of the desired mercator tile.tile
inrio_tiler.main
to find the intersection between the bounds of asset and the bounds of the mercator tile, and pass that to_tile_read
. Note thatcogeo-tiler-mosaic
would also need those bounds returned from the function, so for backwards compatibility, creating a separate function might be better.rio-tiler-mosaic
's pixel selection methods to work with data arrays of subsets of the tile, instead of full arrays.I'd be happy to submit PR's for these, because if I'm right it could make
cogeo-mosaic-tiler
really fast.Profiling
Profiling is done using AWS X-Ray, inserting custom timing sections. I changed
mosaic_tiler
to run single-threaded for profiling, but it ran about the same speed as when it was usingThreadPoolExecutor
.Here's a profile of
mosaic_tiler
running for a mercator tile with four assets. Each asset takes a total of 2.5 seconds to load, with ~1.8 seconds of that just invrt.read
here.