kylebarron commented 4 years ago

The following is a description of a possible bottleneck and a proposal which, if I'm right, could shave 9 seconds off of tile merging in my described use case.

I've spent a lot of the evening reading cogeo*/tiler* code, and I think I may have discovered the largest current bottleneck, at least with my dataset: a full mercator tile's data image bitmap and mask is created for each asset, regardless of the amount of overlap.

I think that's a big performance hit when you have several underlying images. In my mosaic_tiler profiling towards the bottom of this post, you can see that the biggest time hit is vrt.read, i.e. creating the data arrays. It's ~1.8s per asset, regardless of the amount of overlap.

Here's an example of where that really hurts. For a web mercator tile at zoom 12 covering West Hollywood (blue), there are six assets that are required to load (brown). Despite using a simpler pixel selection method, i.e. FirstMethod, all 6 assets need to be loaded and parsed, even though only 1 covers a majority of the tile.

Note, however, that since there's essentially zero overlap between assets, if rasterio windows could be used to only read the subset of the tile with valid source , and if vrt.read is linear with the amount of pixels generated, then you could potentially save 9 seconds, or 83% faster on this tile load. (I.e. there are currently 5 extra sets of 512x512 vrt.read operations beyond the one that's needed, and 1.8s * 5 = 9.)

6 underlying assets is the median for my MosaicJSON, so this is a very common occurrence. Here's the distribution of how many assets are in each quadkey:

count    142516.000000
mean          5.508160
min           1.000000
25%           4.000000
50%           6.000000
75%           6.000000
max          18.000000

Proposal

The slowest part of mosaic_tiler is creating data arrays that line up with the mercator tile for each underlying asset. I propose to use rasterio's windows to create a data array and mask using the minimal bounding box of valid source data within the mercator tile of interest.

Take my intial contrived/sketched example above, where an asset's bounds overlap just the top left of the mercator tile. In that case, since the overlap is rectangular in mercator coordinates, the window could read just the overlapping asset data and return an object like the following:

data: a numpy ndarray of size 3 x overlapping mercator width x overlapping mercator height. Pixel values for each band of only the parts of the mercator tile where source data exists.
mask: a numpy ndarray of size 1 x overlapping mercator width x overlapping mercator height. In the case where the overlap is rectangular in mercator coordinates, this would be entirely True.
mercator_bounds: [0, 400, 100, 512]. The lower left and upper right corners in mercator tile coordinates. This would be used in the pixel merging code once each tile's data has loaded.

I believe finding this intersection of valid data would be possible by intersecting the georeferenced bounds of the asset with the bounds of the mercator tile.

In the more general case (e.g. of Landsat data), where the intersection of the image scene and the mercator tile is not rectangular in mercator coordinates, you could still take the minimal intersecting bounding box in mercator coordinates. In this case, the mask is used to filter out the pixels without source data.

In addition to potentially taking much less time, it would take much less memory and could be run on a smaller lambda instance.

Code to update

Update _tile_read in rio_tiler.utils with the option to return a rectangular subset of the desired mercator tile.
Update tile in rio_tiler.main to find the intersection between the bounds of asset and the bounds of the mercator tile, and pass that to _tile_read. Note that cogeo-tiler-mosaic would also need those bounds returned from the function, so for backwards compatibility, creating a separate function might be better.
Update rio-tiler-mosaic's pixel selection methods to work with data arrays of subsets of the tile, instead of full arrays.

I'd be happy to submit PR's for these, because if I'm right it could make cogeo-mosaic-tiler really fast.

Profiling

Profiling is done using AWS X-Ray, inserting custom timing sections. I changed mosaic_tiler to run single-threaded for profiling, but it ran about the same speed as when it was using ThreadPoolExecutor.

Here's a profile of mosaic_tiler running for a mercator tile with four assets. Each asset takes a total of 2.5 seconds to load, with ~1.8 seconds of that just in vrt.read here.

vincentsarago commented 4 years ago

Thanks @kylebarron this is really (really) appreciated.

Earlier last month I started to run some benchmark (using pytest-becnhmark) and also profile some function to see where we will be able to win some seconds because yes when using multiple assets for a tile it can be slow.

e.g. I found that using multithreading (even with only one concurrent thread) was way slower than using a simple function call. But when having more the 3 assets it was then faster (but you still rely on the internet connection speed).

About full tile/partial tile, I'm not convinced this will make a difference to be honest (But I'll love to be wrong). In rio-tiler, we added benchmark, where we test multiple data type, nodata, but also difference between full and partial tile (https://github.com/cogeotiff/rio-tiler/blob/v2/tests/benchmarks/__init__.py#L62-L95)

The results (not sure if you can see the numbers in https://app.circleci.com/pipelines/github/cogeotiff/rio-tiler/18/workflows/431270d7-1e33-4681-ac1d-d1f094096b29/jobs/1017) shows that partial read is faster then full read (while they both return 256x256 arrays.

---------------------------- benchmark 'boundless tile ': 32 tests -----------------------------
Name (time in ms)         Min                Max               Mean             Median          
------------------------------------------------------------------------------------------------
int8-none             10.9991 (1.0)      15.2086 (1.0)      11.6661 (1.0)      11.2810 (1.00)  

------------------------------- benchmark 'full tile ': 32 tests -------------------------------
Name (time in ms)         Min                Max               Mean             Median          
------------------------------------------------------------------------------------------------
uint8-none            12.7991 (1.0)      19.5376 (1.12)     13.8260 (1.01)     13.1201 (1.0)

GDAL is smarter enough to fetch only what it needs thus, partial tile should always be faster then full tile. After yes it means we need to deal with multiple 256x256 arrays but I don't think this slows downs the process is a matter of seconds (I could be wrong).

Other note:

with rio-tiler v2.0 we are using a new calculate_default_transform (from TerraCotta) which should improve the performance in get_vrt_transform.

I'm also surprised vrt.read takes almost the same tile and I wonder if there is not something else going on (with GDAL or Rasterio)

kylebarron commented 4 years ago

About full tile/partial tile, I'm not convinced this will make a difference to be honest (But I'll love to be wrong)

Well given the simple benchmark in https://github.com/cogeotiff/rio-tiler/pull/168, I'm more likely to be wrong. 😞

When I saw that it was taking 1.7s to read every asset, I assumed that it was due to the number of pixels read/generated, but it appears that's a weak correlation.

Most likely, you're right that I underestimated GDAL. It does make sense that GDAL would optimize internally to fill nodatas quickly when it knows it's reading outside the bounds of the source image.

we test multiple data type, nodata, but also difference between full and partial tile

Is full full overlap between the mercator tile and the asset, mask is no overlap, and boundless is partial overlap?

I'm also surprised vrt.read takes almost the same tile and I wonder if there is not something else going on (with GDAL or Rasterio)

I'm still stumped about that. It seems to me like 1.7 seconds on a call to vrt.read is an awfully long time. Especially when my local tests were ~ 26.8 ms.

I also tested the same simple read benchmark in https://github.com/cogeotiff/rio-tiler/pull/168 on a small EC2 instance with 1 CPU in us-west-2 -- the same region as the COG images -- and set address to the S3 path, and read times were ~70ms!

In [8]: address = 's3://naip-visualization/ca/2018/60cm/rgb/34118/m_3411861_ne_11_060_20180723_20190208.tif'

In [9]: %timeit tile(address, x, y, z, tilesize)
72.1 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So why is that 20x slower on Lambda? 🤔

vincentsarago commented 4 years ago

Is full full overlap between the mercator tile and the asset, mask is no overlap, and boundless is partial overlap?

Full means mercator tile full overlapping, boundless means partial read (reading over the bounds of the file)

I'm still stumped about that. It seems to me like 1.7 seconds on a call to vrt.read is an awfully long time. Especially when my local tests were ~ 26.8 ms.

~Network + S3 + requester-pays bucket.~ (edits because you were testing the same file from ec2 and lambda, which should be similar then)

Also Caching might speedup things in EC2.

Can you wrap the rio-tiler call into a multithreading block on your ec2 ?

kylebarron commented 4 years ago

Also Caching might speedup things in EC2.

I think you're right that caching is explaining the difference here, but I don't know where the caching is happening. Is it on the EC2 side or the S3 side?

%timeit runs the provided code many times and averages among them. I came back after ~1 hour and ran it again with a single %time.

In [3]: %time data, mask = tile(address, x, y, z, tilesize)
CPU times: user 77 ms, sys: 8.43 ms, total: 85.4 ms
Wall time: 782 ms

Successive runs were much faster, and after a couple loads, when the COG is presumably fully-cached, down to double-digit ms.

In [24]: %time data, mask = tile(address, x, y, z, tilesize)
CPU times: user 35 ms, sys: 0 ns, total: 35 ms
Wall time: 38 ms

Multithreading

Testing the full tile response time for single threading vs multithreading, multithreading was slower for every tile. These are dev tools network screenshots for the tiles that load in the default view here in SF.

Single threaded, i.e.

    for asset in assets:
        t, m = tiler(asset, tile_x=tile_x, tile_y=tile_y, tile_z=tile_z, **kwargs)
        t = numpy.ma.array(t)
        t.mask = m == 0

        pixel_selection.feed(t)
        if pixel_selection.is_done:
            return pixel_selection.data

Multi threaded, i.e.

    with futures.ThreadPoolExecutor(max_workers=max_threads) as executor:
        future_tasks = [executor.submit(_tiler, asset) for asset in assets]

    for t, m in _filter_futures(future_tasks):
        t = numpy.ma.array(t)
        t.mask = m == 0

        pixel_selection.feed(t)
        if pixel_selection.is_done:
            return pixel_selection.data

I didn't realize it was possible to do this, but on the latter it still profiled the tile load within the threads (note: page is zoomed out, bars towards the bottom are misaligned). Calling vrt.read on each executor takes 7-8 seconds!

Here's the modified source I'm profiling with... The only call within the vrt.read named segment is vrt.read itself.

        xray_recorder.begin_subsegment('vrt.read')
        data = vrt.read(
            out_shape=out_shape,
            indexes=indexes,
            window=out_window,
            resampling=Resampling[resampling_method],
        )
        xray_recorder.end_subsegment()

The only reference within rasterio to multithreading that I could find is in reference to a CPU bound function. Not sure if it's possible to optimize for a network-bound function. I'll have to read up more on the difference between asyncio and multithreading to see if there could be some optimizations there.

geospatial-jeff commented 4 years ago

I love all of the benchmarks! I've been writing an asyncio based COG tiler and this thread inspired me to benchmark its performance on mosaic datasets against the current implementation.

The asyncio implementation is very similar to what currently happens in rio_tiler_mosaic, but rather than implementing concurrency through concurrency.futures.ThreadPoolExecutor, it wraps rio_tiler.main.tile in a dummy coroutine and uses await asyncio.gather(*coroutines).

For test data, I took the NAIP image @kylebarron has been using for his benchmarks (https://github.com/cogeotiff/rio-tiler/pull/168) and uploaded it to an S3 bucket using 50 different keys to replicate the workflow of requesting the same tile over many different images. This also prevents GDAL from caching the file handler during the benchmark.

Because utils.tile_read seems to be taking up the majority of the request, I only benchmarked the logic to map a tile read across an array of images using both methods. To keep things similar, I used an EC2 instance with a single cpu (t2 small) and used %timeit to record times (7 runs, 1 loop each).

I ran a few benchmarks:

Reading 5 tiles with concurrency of 1, 2, 3, 4, 5.
Reading 20 tiles with concurrency of 1, 5, 10, 15, 20.
Reading 50 tiles with concurrency of 1, 5, 10, 25, 50.

Here are the results in some very poorly formatted matplotlib diagrams:

n = 5

async_tiler_benchmark_n5

n = 20 (oops the title is wrong)

async_tiler_benchmark_n20

n = 50

async_tiler_benchmark_n50

Because the instance is pretty small, we pretty quickly become IO bound at higher concurrencies, here are the network logs from the instance:

A few reasons why the asyncio implementation outperforms:

Asyncio is really good at managing concurrent tasks which spend most of their time waiting on some precondition (in this case a network connection). By far the most expensive part of reading a tile from a COG is time to first bytes. While a thread is blocked, it will move on to do other work until it is unblocked (ex. read another tile). The performance increase becomes even more apparent across multiple tile requests (i.e. rendering tiles with a frontendlib) because threads will move on to other tile requests when blocked.
- Note: the time to first byte is the same reason why reading a partial tile from a COG is not as fast as you would expect when compared to the entire tile.
Both asyncio and threading have to work around the GIL, but asyncio does so more efficiently because its implemented in the framework (rather than the hardware). The result is less overhead spent fighting over threads and more time doing actual work.

kylebarron commented 4 years ago

Wow. I was about to give up after a full-day of research.

I'm impressed with the asyncio performance despite being "above" rasterio. I.e. I was thinking that to get performance gains you'd have to rewrite a portion of rasterio or GDAL to support async, and reading Even's thoughts made me pessimistic.

Edit: in particular, I'm curious how asyncio knows when rasterio is making a network request, and thus when to pass control back to the event loop.

I've never worked with asyncio in Python before, do you have a code snippet you can share that I can test instead of the current threading implementation?

I'd try to publish an asyncio version on lambda and see how the benchmarks go

kylebarron commented 4 years ago

I'm still working on trying to set up the coroutines correctly... This is my most recent effort but I don't think it's correctly returning the data

    async def tiler_coroutine(*args, **kwargs):
        return _tiler(*args, **kwargs)

    async def main():
        return await asyncio.gather(*[tiler_coroutine(asset) for asset in assets])

    tasks = asyncio.run(main())

vincentsarago commented 4 years ago

@kylebarron I realized that's there might be something else going on. Can you try setting GDAL_DISABLE_READDIR_ON_OPEN=TRUE here https://github.com/developmentseed/cogeo-mosaic-tiler/blob/master/serverless.yml#L48

kylebarron commented 4 years ago

It doesn't really seem like it makes a difference

Multithreaded, GDAL_DISABLE_READDIR_ON_OPEN set to YES

Multithreaded, GDAL_DISABLE_READDIR_ON_OPEN set to EMPTY_DIR

vincentsarago commented 4 years ago

Yes sorry I've hided the comment because after another check I figured it won't change anything 🤦‍♂

kylebarron commented 4 years ago

I seem to have "working" async code:

    async def tiler_coroutine(*args, **kwargs):
        try:
            return _tiler(*args, **kwargs)
        except Exception:
            return None

    async def main():
        return await asyncio.gather(*[tiler_coroutine(asset) for asset in assets])

    tasks = asyncio.run(main())
    tasks = [t for t in tasks if t is not None]

it works in the sense that it computes correctly and I can see the output image, but with the tracing I can see it's not concurrent:

So I'm not sure what I'm doing wrong. @geospatial-jeff any advice?

kylebarron commented 4 years ago

tl;dr: Use GDAL 2.4 and all your problems will be solved 😅

kylebarron commented 4 years ago

cc @geospatial-jeff

vrt.read dropped down to 60-120ms (from 1.7s) just by switching back to GDAL 2.4. This is without using asyncio.

Also, I found that exporting a @2x JPEG is about ~65ms, while a @2x PNG is about ~400ms, and the JPEG is additionally like ~100KB instead of ~500KB for the PNG, so it's a win-win.

developmentseed / cogeo-mosaic-tiler

Proposal for faster tile merging #4

Proposal

Code to update

Profiling

Multithreading

n = 5

n = 20 (oops the title is wrong)

n = 50