Unidata / MetPy

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.
https://unidata.github.io/MetPy/
BSD 3-Clause "New" or "Revised" License
1.24k stars 413 forks source link

Support for Dask Arrays #1479

Open rpmanser opened 4 years ago

rpmanser commented 4 years ago

As pieces for compatibility between Dask, Xarray, and Pint are coming together (e.g., xarray/#4221, pint/#1129, pint#1151, https://github.com/dask/dask/pull/6393, among others that I likely missed) we should begin keeping track of progress for Dask Array support in MetPy. This might help address #996, or the previously mentioned PRs may already allow for it.

I started testing support for Dask Arrays in #1388, but that should probably be separated from that PR. Most functions in the basic.py module just work with Pint wrapped Dask Arrays without extra effort, but there are some exceptions. These include wind_direction, heat_index, apparent_temperature (working on heat_index should fix this), and pressure_to_height_std. The first two/three are a result of ambiguous array sizes due to boolean indexing. For example in wind_direction:

https://github.com/Unidata/MetPy/blob/071c700aa05b1962b8c598d3b999ae5657944cc8/src/metpy/calc/basic.py#L101-L103

mask = wdir <= 0 produces a boolean array of undetermined size, which leads to an axis of undetermined size in the Dask Array, which is given a nan value e.g., (50, nan, 100). This is expected Dask behavior AFAIK. That should be pretty easy to fix with an np.where.

I haven't dug into why pressure_to_height_std doesn't work. I'll see if I can do that soon.

Another problem that is likely to come up, but I haven't dug into yet, involves scipy functions. I'm not sure how these will interact with Dask Arrays but there may need to be some special handling for that. Specifically I'm thinking of any vertical profile calculations (e.g., CAPE)... chunking is going to be very important for those.

Some discussion points:

cc: @jthielen @tjwixtrom

dopplershift commented 4 years ago

Thanks for starting the issue. A few thoughts I have right now:

To what extent will MetPy support Dask Arrays? Should it be limited to certain modules? I think metpy.calc is a good place to start.

metpy.calc and metpy.interpolation seem like natural candidates. I don't think plots and io need specific support, but I'm open to being convinced otherwise.

Some refactoring will probably be needed for individual functions (e.g., wind_direction). Changes in performance to accommodate Dask support should be considered for each case.

I'm ok with making changes to support. What we should do then is document in the developer's guide what the subset of the numpy and/or xarray API we're allowed to use is.

Should we avoid explicit use of Dask functions so that Dask is not a dependency? Will this be possible?

Everything is always up for discussion, but my starting point is that I don't think MetPy should ever have a dask dependence. It should work like we want to get to with CartoPy--if you use it, we'll play along (i.e. do operations that work with dask).

Another problem that is likely to come up, but I haven't dug into yet, involves scipy functions.

For this and the previous item, we may be able to hang back ever so slightly and surf the wave of changes coming in various NEPs and other things that are trying to help solve these and other issues (JAX/PyTorch/TensorFlow arrays) elsewhere.

I think this is going to require a very pragmatic and use-case-driven approach. I don't want to put off useful things forever (e.g. dask-powered gridded cape); I also don't want to compromise any of our simplicity (for our users) in the name of supporting dask. I also don't want to invest significant effort in certain areas if waiting will solve them for us.

Just the thoughts that came to mind for now. Don't consider any of it set in stone.

jthielen commented 4 years ago

Great summary @rpmanser! I added https://github.com/dask/dask/pull/6393 to your list of upstream PRs, which looks to be merged very soon.

Here are my thoughts on your discussion points at the moment (many of which overlap with @dopplershift's comment that came up right before I posted mine):

I think MetPy should take a progressive approach to Dask Array support, starting with the easy options (e.g., scalar thermo calculations) and gradually moving towards full (or near full) support in the long-term. I would love to eventually be able to do everything in MetPy with xarray > Pint > Dask data structures, but this would have a rather long time horizon and would intersect with issues in the broader community (rechunking; generic non-local calculations like interpolation, smoothing, derivatives, and integration; SciPy support; etc.).

That being said, I don't think Dask should ever become a required dependency of MetPy. I think it should always be possible to avoid unconditional use of Dask functions, albeit at the cost of increased complexity. But, the trade-off (complexity vs. supporting both Dask and non-Dask use) is worth it in my mind.

jthielen commented 4 years ago

Also, if I had to guess, your issue with pressure_to_height_std is due to type hierarchy issues. See if it is resolved with Dask master after https://github.com/dask/dask/pull/6393 is merged.

dopplershift commented 4 years ago

Just to clarify something I said: Complexity is always an option in terms of MetPy implementation; we just don't want to present any more complexity to the users than is absolutely necessary. This is also why we need a good test suite of the options. 😉

rpmanser commented 4 years ago

Great, this is in line with what I was thinking. Avoiding Dask as a required dependency, starting simple and working towards more complex issues, and not compromising simplicity are generally good ideas. Once these PRs are in I'll start resolving issues in basic.py, then we can go from there.

jthielen commented 4 years ago

With https://github.com/dask/dask/pull/6393 being merged, it looks like just https://github.com/pydata/xarray/pull/4221 is left for upstream PRs here (and that one is approved and awaiting merge). This is getting close!

Update: not even a half-hour later and https://github.com/pydata/xarray/pull/4221 has been merged!

raybellwaves commented 3 years ago

I'm interested is something similar to the ideas proposed in the thread. I can move this out to a separate issue if desired.

I would like to drop in metpy.calc.wind_speed instead of xarray.ufuncs.sqrt in my workflow. I think this is inline with the ideas discussed here.

import xarray as xr
import xarray.ufuncs as xu
import metpy
import metpy.calc

ds = xr.open_dataset('https://thredds.ucar.edu/thredds/dodsC/grib/NCEP/GFS/Global_0p25deg/Best')
ds = ds.metpy.parse_cf()

u10 = ds['u-component_of_wind_height_above_ground'].sel(height_above_ground4=10, lat=50, lon=50)
v10 = ds['v-component_of_wind_height_above_ground'].sel(height_above_ground4=10, lat=50, lon=50)

ws10 = xu.sqrt(u10 ** 2 + v10**2)
ws10 = metpy.calc.wind_speed(u10, v10)
rpmanser commented 3 years ago

@raybellwaves I ran your example and it does not seem that your data are represented as dask arrays. Were you intending for them to be? I'm not as familiar with thredds as I ought to be, but my understanding is that the data it downloads is loaded into memory. If that is the case, then dask arrays may not be helpful, since part of their purpose is to load data into memory only when .compute() is called.

Maybe someone more knowledgeable with thredds can help here...

dopplershift commented 3 years ago

Here it's using OPeNDAP, so it should be possible to only download data on demand. However, I'm not really clear on how to engage dask with a single dataset loaded with OPeNDAP. Or maybe it depends on the native chunking in the dataset?

jthielen commented 3 years ago

I unfortunately don't have time to help investigate here right now (have you tried specifying chunks in the open_dataset call?), but chunks via OPeNDAP reminded me of this issue, which may be of interest: https://github.com/pydata/xarray/issues/1440.

raybellwaves commented 3 years ago

but my understanding is that the data it downloads is loaded into memory.

Ah thanks. My question is probably more suited to xarray-metpy integration rather that dask-metpy integration

rpmanser commented 3 years ago

Despite discussion of using other libraries to speed up MetPy, I would still like to see dask supported. I think most of the points brought up here haven't changed. Since the idea behind #1388 is an absolute beast, it may be best to set that aside and create tests only for dask arrays. I'm not sure what that would look like, so I'd like to know if anyone has thoughts on it. A simple approach might be to just add a test against dask for each function, module by module, but that will bloat up the test suite quite quickly. Is there a better solution?

dopplershift commented 3 years ago

@rpmanser Well while we might not be looking to use Dask internally as a performance solution, we definitely do want to support its use for others who have gone to the effort of setting up their data to do so.

What I wonder is if we could do something like (completely untested code):

import pytest

parameterize_array_types = pytest.parameterize('array_type', [np.array, dask.array, np.ma.array])

@parameterize_array_types
def test_calc(array_type)
    test_input = units.Quantity(array_type(data), 'myunits')
    truth = units.Quantity(array_type(truth_data), 'newunits')

   result = calc(test_input)
   assert_almost_equal(test_input, truth)

I haven't thought through what the challenges with that, but I think that would allow us to start testing things like Dask without copy-pasting a whole bunch of tests?

rpmanser commented 3 years ago

@dopplershift that seems pretty straightforward and looks like it will work. I'll give that a try on the basic module in calc soon then we can go from there.

lpilz commented 2 years ago

So I am running into a related issue (or maybe even the same?) When calculating the wind direction from 2 3D dask arrays (U10, V10 shape: (169, 185, 200)), I get an IndexError here:

https://github.com/Unidata/MetPy/blob/026778cccbb8e9fc8e9154bb2b592bf1fff22206/src/metpy/calc/basic.py#L104

Incorrect shape ((169, 185, 200)) of integer indices for dimension with size 169

I am seeing that wdir[mask] looks like dask.array<getitem, shape=(nan,), dtype=float32, chunksize=(nan,), chunktype=numpy.ndarray> degree with shape (nan,) - so maybe this is the problem?

dopplershift commented 2 years ago

@lpilz The problem is that dask doesn't support the following in general:

a[a > 2] += 4

I haven't dug in to figure out why it doesn't or what it would take. I'm open to reworking our implementation of wind_direction, but I'm not sure you how you efficiently do that operation without boolean masking.

lpilz commented 2 years ago

@dopplershift Thanks for explaining, that clarifies things. Could one maybe turn it on its head and do something like this or is the problem that bool arrays don't work?

a += (a > 2)*4
dopplershift commented 2 years ago

@lpilz Honestly, I have no idea what the problem is. If a dask user with a vested interest :wink: were to submit a PR with that change and it didn't break any tests (and fixed dask) then I'd be willing to merge it.

sgdecker commented 1 year ago

Not sure if this should go here or in a new issue (or even if this is an issue), but with this code:

import dask.array as da
import metpy.calc as mpcalc
from metpy.units import units

p = da.asarray([1000, 900, 800, 700, 600, 500, 400, 300, 200, 100]) * units('hPa')
T = 20 * units('degC')
Td = 18 * units('degC')
prof = mpcalc.parcel_profile(p, T, Td)

I get ominous warnings:

/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.copyto` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.place` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.copyto` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.place` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.copyto` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.place` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(

As far as I can tell, MetPy does not directly call the functions in question, but it would certainly be nice if this code did not stop working in a future release.

dopplershift commented 1 year ago

@sgdecker I think it's good to capture it here. This particular function is high on our list to make sure it works with Dask for our current funded work on MetPy.

Out of curiosity, what's the scale of the data you're using for your "real" problem? The example above with a single profile isn't really doing much with Dask.

sgdecker commented 1 year ago

@dopplershift Yes, this is scaled down for example purposes. The real data is climate model output, so we are looking at something on the order of 10,000 gridpoints and thousands of times.

dopplershift commented 1 year ago

@sgdecker I'll be sure to reach out when we have something, then. That's literally one of the core workflows/use cases that we're trying to support with the funded work we have.

rbavery commented 1 month ago

Hello curious if there is an update on Dask support?

dopplershift commented 1 month ago

Nothing really new to report. I'm curious if you've tried using dask with MetPy and run into any calculations in particular that cause you problems?

rbavery commented 4 weeks ago

We haven't started anything yet but might be using MetPy to retrieve NexRAD data, filtering for specific variables, convert from polar to geographic coordinates, calculate variables with MetPy, and write the results to Zarr. It seems like one of the more user friendly packages for reading Level 2 or Level 3 data.

dopplershift commented 4 weeks ago

You might want to consider xradar for that, at least for level 2, since that's reading nexrad level 2 directly into xarray.

rbavery commented 4 weeks ago

Thank you! Will take a look.