Improve Regridding performance

mpiannucci commented 2 years ago

The /image/tile api uses xESMF for regridding on the fly. This means that the weights are recomputed every single time the /tile api is called which is absolutely terrible across the board.

Initial idea is to precompute the weights for every level, then only apply the grid, reproject and Clip the tile. Should prob just cache the regridded values but trying to keep things idempotent as possible

abkfenris commented 2 years ago

How about caching the weights on first access for a level?

mpiannucci commented 2 years ago

How about caching the weights on first access for a level?

Yeah I think that’s a good compromise. I am worried about when multiple tiles from a single level are requested concurrently, so prob need a mutex in there to avoid clashing, might try that tonight

mpiannucci commented 2 years ago

So in trying this last night, xESMF is very slow to generate weights for the regridder once you get past zoom level one. It wasn't as bad when originally regridding to only the tile extents.

Need to rethink the best way to manage this, ncwms seems to be able to do this very fast so will use that as inspiration

abkfenris commented 2 years ago

Is it actually delaying calculation? It might need to have an explicitly instantiated dask cluster in scope before it will defer calculation.

mpiannucci commented 2 years ago

Not sure. I have another idea using just reprojection i am going to test

jmunroe commented 2 years ago

I think @abkfenris is correct -- without a dask scheduler set up the calculation will not be lazy.

mpiannucci commented 2 years ago

We can get away with just reprojecting with rioxarray and its a lot faster than regridding with xesmf Screen Shot 2022-04-29 at 10 23 51 AM

mpiannucci commented 2 years ago

^^ but that doesn't solve longer term issues with lazy loading because rioxarray needs the whole dataset to reprojrct at the moment.

So good enough to get data to the zarr pyramid endpoints as a POC but not the long term solution for now

abkfenris commented 2 years ago

For checking if there are any dask clusters that have been implicitly created, distributed.client._global_client_index should give a dictionary of any dask clients currently in scope. via https://github.com/pangeo-forge/pangeo-forge-recipes/pull/350/files#r861207572

mpiannucci commented 2 years ago

So in testing, the bottleneck for using xESMF is generating the weights, which can not be done in parallel using dask or otherwise when using xESMF. The actual regridding is fast enough once the weights are generated

jmunroe commented 2 years ago

Are grid weights (from model grid to some “common” coordinate systems) something that should be precomputed and stored adjacent to the underlying dataset? Is that possible?

On Sat, Apr 30, 2022 at 10:26 AM Matthew Iannucci @.***> wrote:

So in testing, the bottleneck for using xESMF is generating the weights, which can not be done in parallel using dask or otherwise when using xESMF. The actual regridding is fast enough once the weights are generated

— Reply to this email directly, view it on GitHub https://github.com/asascience/restful-grids/issues/17#issuecomment-1113997629, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPFFO3N5GJZLWXQNRHYOQDVHU7KFANCNFSM5UTIR24Q . You are receiving this because you commented.Message ID: @.***>

-- JAMES MUNROE | ASSOCIATE PROFESSOR

Department of Physics and Physical Oceanography Memorial University of Newfoundland 230 Elizabeth Avenue St. John’s, Newfoundland, Canada A1C 5S7 Chemistry and Physics Building | Room C 4060 T 709 864 7362 | M 709 771 0450

www.physics.mun.ca

abkfenris commented 2 years ago

While we could pre-compute weights ourselves, it would be nice to not have to define a single specific sidecar file. I'd still lean towards adding them to the cache.

How about adding a datasets/{dataset_id}/tree/cache route, that then fires off a background task to build up the weights and cache them? And we could either try to generate on the fly or fire off an error with a message to hit that endpoint if necessary if the weights aren't cached.

Also on the cache side of things we might want to explore overriding the current xpublish.get_cache(). The current cache store is a dict, so it's confined to a single process (also we might want to explore running it in gunicorn to enable multiple accesses), but cachey.Cache's data store is pluggable. We could try something like redis_collections.Dict or other MutableMappings that could be swapped in place.

asascience-open / restful-grids

Improve Regridding performance #17