Open rpmanser opened 4 years ago
Thanks for starting the issue. A few thoughts I have right now:
To what extent will MetPy support Dask Arrays? Should it be limited to certain modules? I think metpy.calc is a good place to start.
metpy.calc
and metpy.interpolation
seem like natural candidates. I don't think plots
and io
need specific support, but I'm open to being convinced otherwise.
Some refactoring will probably be needed for individual functions (e.g., wind_direction). Changes in performance to accommodate Dask support should be considered for each case.
I'm ok with making changes to support. What we should do then is document in the developer's guide what the subset of the numpy and/or xarray API we're allowed to use is.
Should we avoid explicit use of Dask functions so that Dask is not a dependency? Will this be possible?
Everything is always up for discussion, but my starting point is that I don't think MetPy should ever have a dask dependence. It should work like we want to get to with CartoPy--if you use it, we'll play along (i.e. do operations that work with dask).
Another problem that is likely to come up, but I haven't dug into yet, involves scipy functions.
For this and the previous item, we may be able to hang back ever so slightly and surf the wave of changes coming in various NEPs and other things that are trying to help solve these and other issues (JAX/PyTorch/TensorFlow arrays) elsewhere.
I think this is going to require a very pragmatic and use-case-driven approach. I don't want to put off useful things forever (e.g. dask-powered gridded cape); I also don't want to compromise any of our simplicity (for our users) in the name of supporting dask. I also don't want to invest significant effort in certain areas if waiting will solve them for us.
Just the thoughts that came to mind for now. Don't consider any of it set in stone.
Great summary @rpmanser! I added https://github.com/dask/dask/pull/6393 to your list of upstream PRs, which looks to be merged very soon.
Here are my thoughts on your discussion points at the moment (many of which overlap with @dopplershift's comment that came up right before I posted mine):
I think MetPy should take a progressive approach to Dask Array support, starting with the easy options (e.g., scalar thermo calculations) and gradually moving towards full (or near full) support in the long-term. I would love to eventually be able to do everything in MetPy with xarray > Pint > Dask data structures, but this would have a rather long time horizon and would intersect with issues in the broader community (rechunking; generic non-local calculations like interpolation, smoothing, derivatives, and integration; SciPy support; etc.).
That being said, I don't think Dask should ever become a required dependency of MetPy. I think it should always be possible to avoid unconditional use of Dask functions, albeit at the cost of increased complexity. But, the trade-off (complexity vs. supporting both Dask and non-Dask use) is worth it in my mind.
Also, if I had to guess, your issue with pressure_to_height_std
is due to type hierarchy issues. See if it is resolved with Dask master after https://github.com/dask/dask/pull/6393 is merged.
Just to clarify something I said: Complexity is always an option in terms of MetPy implementation; we just don't want to present any more complexity to the users than is absolutely necessary. This is also why we need a good test suite of the options. 😉
Great, this is in line with what I was thinking. Avoiding Dask as a required dependency, starting simple and working towards more complex issues, and not compromising simplicity are generally good ideas. Once these PRs are in I'll start resolving issues in basic.py
, then we can go from there.
With https://github.com/dask/dask/pull/6393 being merged, it looks like just https://github.com/pydata/xarray/pull/4221 is left for upstream PRs here (and that one is approved and awaiting merge). This is getting close!
Update: not even a half-hour later and https://github.com/pydata/xarray/pull/4221 has been merged!
I'm interested is something similar to the ideas proposed in the thread. I can move this out to a separate issue if desired.
I would like to drop in metpy.calc.wind_speed
instead of xarray.ufuncs.sqrt
in my workflow. I think this is inline with the ideas discussed here.
import xarray as xr
import xarray.ufuncs as xu
import metpy
import metpy.calc
ds = xr.open_dataset('https://thredds.ucar.edu/thredds/dodsC/grib/NCEP/GFS/Global_0p25deg/Best')
ds = ds.metpy.parse_cf()
u10 = ds['u-component_of_wind_height_above_ground'].sel(height_above_ground4=10, lat=50, lon=50)
v10 = ds['v-component_of_wind_height_above_ground'].sel(height_above_ground4=10, lat=50, lon=50)
ws10 = xu.sqrt(u10 ** 2 + v10**2)
ws10 = metpy.calc.wind_speed(u10, v10)
@raybellwaves I ran your example and it does not seem that your data are represented as dask arrays. Were you intending for them to be? I'm not as familiar with thredds as I ought to be, but my understanding is that the data it downloads is loaded into memory. If that is the case, then dask arrays may not be helpful, since part of their purpose is to load data into memory only when .compute()
is called.
Maybe someone more knowledgeable with thredds can help here...
Here it's using OPeNDAP, so it should be possible to only download data on demand. However, I'm not really clear on how to engage dask with a single dataset loaded with OPeNDAP. Or maybe it depends on the native chunking in the dataset?
I unfortunately don't have time to help investigate here right now (have you tried specifying chunks
in the open_dataset
call?), but chunks via OPeNDAP reminded me of this issue, which may be of interest: https://github.com/pydata/xarray/issues/1440.
but my understanding is that the data it downloads is loaded into memory.
Ah thanks. My question is probably more suited to xarray-metpy integration rather that dask-metpy integration
Despite discussion of using other libraries to speed up MetPy, I would still like to see dask supported. I think most of the points brought up here haven't changed. Since the idea behind #1388 is an absolute beast, it may be best to set that aside and create tests only for dask arrays. I'm not sure what that would look like, so I'd like to know if anyone has thoughts on it. A simple approach might be to just add a test against dask for each function, module by module, but that will bloat up the test suite quite quickly. Is there a better solution?
@rpmanser Well while we might not be looking to use Dask internally as a performance solution, we definitely do want to support its use for others who have gone to the effort of setting up their data to do so.
What I wonder is if we could do something like (completely untested code):
import pytest
parameterize_array_types = pytest.parameterize('array_type', [np.array, dask.array, np.ma.array])
@parameterize_array_types
def test_calc(array_type)
test_input = units.Quantity(array_type(data), 'myunits')
truth = units.Quantity(array_type(truth_data), 'newunits')
result = calc(test_input)
assert_almost_equal(test_input, truth)
I haven't thought through what the challenges with that, but I think that would allow us to start testing things like Dask without copy-pasting a whole bunch of tests?
@dopplershift that seems pretty straightforward and looks like it will work. I'll give that a try on the basic module in calc soon then we can go from there.
So I am running into a related issue (or maybe even the same?) When calculating the wind direction from 2 3D dask arrays (U10
, V10
shape: (169, 185, 200)
), I get an IndexError here:
Incorrect shape ((169, 185, 200)) of integer indices for dimension with size 169
I am seeing that wdir[mask]
looks like dask.array<getitem, shape=(nan,), dtype=float32, chunksize=(nan,), chunktype=numpy.ndarray> degree
with shape (nan,)
- so maybe this is the problem?
@lpilz The problem is that dask doesn't support the following in general:
a[a > 2] += 4
I haven't dug in to figure out why it doesn't or what it would take. I'm open to reworking our implementation of wind_direction
, but I'm not sure you how you efficiently do that operation without boolean masking.
@dopplershift Thanks for explaining, that clarifies things. Could one maybe turn it on its head and do something like this or is the problem that bool arrays don't work?
a += (a > 2)*4
@lpilz Honestly, I have no idea what the problem is. If a dask user with a vested interest :wink: were to submit a PR with that change and it didn't break any tests (and fixed dask) then I'd be willing to merge it.
Not sure if this should go here or in a new issue (or even if this is an issue), but with this code:
import dask.array as da
import metpy.calc as mpcalc
from metpy.units import units
p = da.asarray([1000, 900, 800, 700, 600, 500, 400, 300, 200, 100]) * units('hPa')
T = 20 * units('degC')
Td = 18 * units('degC')
prof = mpcalc.parcel_profile(p, T, Td)
I get ominous warnings:
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.copyto` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.place` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.copyto` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.place` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.copyto` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.place` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
warnings.warn(
As far as I can tell, MetPy does not directly call the functions in question, but it would certainly be nice if this code did not stop working in a future release.
@sgdecker I think it's good to capture it here. This particular function is high on our list to make sure it works with Dask for our current funded work on MetPy.
Out of curiosity, what's the scale of the data you're using for your "real" problem? The example above with a single profile isn't really doing much with Dask.
@dopplershift Yes, this is scaled down for example purposes. The real data is climate model output, so we are looking at something on the order of 10,000 gridpoints and thousands of times.
@sgdecker I'll be sure to reach out when we have something, then. That's literally one of the core workflows/use cases that we're trying to support with the funded work we have.
Hello curious if there is an update on Dask support?
Nothing really new to report. I'm curious if you've tried using dask with MetPy and run into any calculations in particular that cause you problems?
We haven't started anything yet but might be using MetPy to retrieve NexRAD data, filtering for specific variables, convert from polar to geographic coordinates, calculate variables with MetPy, and write the results to Zarr. It seems like one of the more user friendly packages for reading Level 2 or Level 3 data.
You might want to consider xradar for that, at least for level 2, since that's reading nexrad level 2 directly into xarray.
Thank you! Will take a look.
As pieces for compatibility between Dask, Xarray, and Pint are coming together (e.g., xarray/#4221, pint/#1129, pint#1151, https://github.com/dask/dask/pull/6393, among others that I likely missed) we should begin keeping track of progress for Dask Array support in MetPy. This might help address #996, or the previously mentioned PRs may already allow for it.
I started testing support for Dask Arrays in #1388, but that should probably be separated from that PR. Most functions in the
basic.py
module just work with Pint wrapped Dask Arrays without extra effort, but there are some exceptions. These includewind_direction
,heat_index
,apparent_temperature
(working onheat_index
should fix this), andpressure_to_height_std
. The first two/three are a result of ambiguous array sizes due to boolean indexing. For example inwind_direction
:https://github.com/Unidata/MetPy/blob/071c700aa05b1962b8c598d3b999ae5657944cc8/src/metpy/calc/basic.py#L101-L103
mask = wdir <= 0
produces a boolean array of undetermined size, which leads to an axis of undetermined size in the Dask Array, which is given anan
value e.g.,(50, nan, 100)
. This is expected Dask behavior AFAIK. That should be pretty easy to fix with annp.where
.I haven't dug into why
pressure_to_height_std
doesn't work. I'll see if I can do that soon.Another problem that is likely to come up, but I haven't dug into yet, involves scipy functions. I'm not sure how these will interact with Dask Arrays but there may need to be some special handling for that. Specifically I'm thinking of any vertical profile calculations (e.g., CAPE)... chunking is going to be very important for those.
Some discussion points:
metpy.calc
is a good place to start.wind_direction
). Changes in performance to accommodate Dask support should be considered for each case.cc: @jthielen @tjwixtrom