Unidata / MetPy

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.
https://unidata.github.io/MetPy/
BSD 3-Clause "New" or "Revised" License
1.24k stars 413 forks source link

Dask delayed operations and Pint Quantity #996

Open huard opened 5 years ago

huard commented 5 years ago

I haven't found from the documentation whether MetPy supports delayed operations with dask. The code for unit conversion seems to access _data_array.values, which suggests that the entire array is loaded in memory. We have multi Gb files that require unit conversion and ideally the converted DataArray would be lazily evaluated.

dopplershift commented 5 years ago

I haven't taken a look with dask yet, but your initial analysis seems about right unfortunately.

This might be out of our control due to pint, but is definitely something on my todo list to take a look at. I can see it also being another reason to adjust how we handle the unit problem internally.

tjwixtrom commented 5 years ago

Upon attempting this with RH calculations over a large dataset in Xarray (with Dask enabled) I can confirm that calling MetPy does load the arrays memory. Dask still allows for parallel computations on chunks which does keep from runaway RAM usage and performance is acceptable, but the lazy evaluation stops at the point of calling metpy.calc. The temporary workaround would be to do all subsetting operations before metpy computations.

huard commented 5 years ago

We've gone around this issue by calling units.convert instead of using the to method.

dopplershift commented 5 years ago

So you're saying .to() forces breaks the parallelism but units.convert() doesn't?

huard commented 5 years ago

It's probably not that straightforward and it's been a while, but I think using to, we had to copy the input array, then use the output of to and change the values in place. I don't quite remember why we needed this copy, but this was the culprit, not the conversion itself (see https://github.com/Ouranosinc/xclim/pull/156/files).

jthielen commented 4 years ago

Just an update from upstream: Pint v0.10 (to be released in the next week or so) will have preliminary support for wrapping Dask arrays. However, https://github.com/dask/dask/issues/4583 is holding up full compatibility and the ability to put together a robust set of integration tests, so there will likely be issues remaining (such as non-commutativity and Dask mistakenly wrapping Pint).

So, from MetPy's point-of-view, I think it would be good to start some early experiments with Dask support in calculations, but it won't be ready for v1.0?

dopplershift commented 4 years ago

That seems about right. I think overall full support will be something we look at beyond the GEMPAK work.

jthielen commented 4 years ago

Leaving a note here for future Dask compatibility work: the window smoother added in #1223 explicitly casts to ndarray, which prevents Dask compatibility for that smoother (and dependent smoothers like circular, rectangular, and n-point) (see https://github.com/Unidata/MetPy/pull/1223#discussion_r366079624).