[Feature] Add Dask integration

lpilz commented 3 years ago

I'm currently looking around for a python package which is able to extract some diagnostics from WRF output and do computations in a distributed way, as I will handle very large datasets in the future. As salem is built on pure python and uses xarray in the background, I prefer it over wrf-python (also because development on wrf-python pretty much ceased).

I already tried to get salem to run with dask in the most naive way possible, however I stumbled over the problem that workers die when executing the open_wrf_dataset function with the error distributed.protocol.core - CRITICAL - Failed to deserialize and TypeError: Could not serialize object of type ImplicitToExplicitIndexingAdapter. I had a quick look into the salem code and I think this might be due to the FakeVariable objects not having serializers associated with them. Does this sound sensible to you?

Since I know that you, @fmaussion, don't have a lot of time to spend on this project I thinking about adding dask compatibility myself. However, before jumping the gun I just wanted to ask if there was any prior work I could maybe build on or if you have any advice.

fmaussion commented 3 years ago

Thanks for reaching out!

As far as I was aware, these "netcdf mocks" I wrote used to be supported by dask. All the @requires_dask marked tests here do use dask, although they might be too simple nowadays.

I would love someone to pick up this project! Other voices on this repo have said that the diagnostic variables from salem are useful to them. What I think would be more future-proof would be to use some of the ideas in salem and make a new, dedicated project just for that (WRF diagnostic variables, maybe plotting although metpy might be used for that as well).

Back in the days, I was never 100% sure that mocking netcdf variables was the best way to do it (it feels quite hacky). The other alternative was to use dask to create delayed variables which are computed on demand, but I'm not sure this is possible given the complexity here (unstaggering, etc.)

jthielen commented 3 years ago

I would love someone to pick up this project! Other voices on this repo have said that the diagnostic variables from salem are useful to them. What I think would be more future-proof would be to use some of the ideas in salem and make a new, dedicated project just for that (WRF diagnostic variables, maybe plotting although metpy might be used for that as well).

As someone on the MetPy side who's been very interested in better WRF handling, but haven't been able to prioritize the cycles to work on spinning up something building off of salem on my own, I'd definitely be interested in collaborating on such a new project. Based on the discussion in https://github.com/Unidata/MetPy/issues/1004, @tjwixtrom has some initial implementations of a Dask-centric approach to WRF output handling, so that could be another useful starting point. That being said, based on the success of xgcm in the pangeo ecosystem, I'd also be inclined to directly build-in staggered grid handling using xgcm rather than just support unstaggering.

Would it be worth trying to get interested individuals together to spin up such a community project (perhaps under the pangeo umbrella)?

lpilz commented 3 years ago

@fmaussion Thanks for pointing me to those tests, I completely missed them. Generating dask-powered xarrays using ds = salem.open_wrf_dataset(salem.utils.get_demo_file('wrf_tip_d1.nc')).chunk(chunks={"west_east": 100, "south_north": 100}) works nicely but I tried ds = salem.open_wrf_dataset(salem.utils.get_demo_file('wrf_tip_d1.nc'), chunks={"west_east": 100, "south_north": 100}), which fails. Somehow accessing the raw data fails due to serialization problems.

When I first saw those mock variables, I immediately had to think about using delayed variables instead. Using xgcm for handling staggered grids does look interesting indeed. I'll have a proper look at it soon, however at the moment I'm not convinced that it is able to delay the unstaggering operation.

@jthielen It's good to know that there is some interest in crafting a robust solution for this problem. I will spend the next 3 years of my PhD working on high-resolution WRF modelling, so having a parallel and robust framework available for analysis is a priority for me and I'd be willing to invest some time. However, while I'd definitely be interested in joining the effort I don't have a huge amount of software engineering background apart from the usual university data analysis hacking stuff. So this would definitely be a learning experience for me but I am motivated to do this.

fmaussion commented 3 years ago

Maybe you two can reach out in the Pangeo discourse to gain momentum. My experience however is that you'll need one or two highly motivated people able to work on a working prototype once you have sorted out the design details (e.g. mocks vs delayed).

tjwixtrom commented 3 years ago

Regarding the code mentioned by @jthielen, that is a relatively simple solution based in Xarray/dask. I have been using it as the basis for my current work. It scales relatively well (for my purposes at least) and can handle quite large datasets. The major difference from something like XGCM is that it has been designed to be WRF-specific, which results in a simpler user interaction but is much less flexible in terms of other models/grids. I am happy to contribute this as the starting point to get something going based on existing projects.

lpilz commented 3 years ago

@tjwixtrom I think your code could be a nice starting point indeed. Some ideas on how to build upon it might be: formalizing the CF-compliance by using e.g. cfdm or by just abstracting the cf-vars into a class to reduce code duplication, implementing the remaining grid types (possibly using xgcm grids), supporting multiple datasets, abstracting diagnostic variable definitions and maybe adding additional ones from #18.

I don't think that keeping it WRF-specific is a bad thing per se, however it could be nice to try to keep the possibility of extending the software for models like MPAS or ICON in mind. In the end, I think prioritizing providing the functionality of wrf-python and salem in a scalable way would be a good focus.

jthielen commented 3 years ago

@lpilz Definitely good ideas! I might suggest building CF-related functionality around cf-xarray to best integrate with the pangeo stack, instead of using something separate from xarray. Keeping it WRF-specific helps manage the scope of the project, and if we combine that with a philosophy of contributing any new general-purpose functionality we need to upstream projects (e.g., cf-xarray, xgcm, MetPy), then it should be pretty easy to extend to other models later on. If you're willing to dedicate the time to working on this, I'd be glad to join in and help on the software engineering/package management side.

Taking the ideas brought up so far, here's what I'd envision this package looking like:

Metadata transformation from raw WRF output (that's already been opened with the default lazy-loading in xarray) to CF/COMODO conventions, borrowing what we can from salem
C-grid and vertical coordinate handling and transforms through xgcm
WRF-specific diagnostic variables computable with dask delayed
For general-purpose diagnostic variables, contribute dask-friendly versions/modifications upstream in MetPy
Port salem's geogrid_simulator
Include lots of example notebooks in documentation about how this new package's functionality plays along with the other essential packages

Which, in practice, could look something loosely like (for post-processing to isobaric levels then a quick preview plot of 700 hPa RH):

from metpy.units import units
import numpy as np
import xarray as xr
import xgcm

import newWrfPackage

ds = xr.open_dataset("wrfout_d01_2021-01-01_00:00:00", engine="netcdf").wrf.parse_wrf(
    include_diagnostics=["air_temperature", "geopotential_height"]
)
grid = xgcm.Grid(ds)

variables = ["air_temperature", "humidity_mixing_ratio", "x_wind", "y_wind", "geopotential_height"]
isobaric_levels = np.linspace(1000, 100, 37) * units.hPa
ds_isobaric = grid.transform(
    ds.cf[variables], "Z", isobaric_levels, target_data=ds.cf["air_pressure"], method="logarithmic"
)

rh = ds_isobaric.metpy.calc("relative_humidity")
rh.metpy.sel(vertical=700 * units.hPa, time="2021-01-01 12:00").plot()

Initial real examples would almost certainly be messier, as this sample does leverage some not-yet-ready features like xgcm unit-awareness and MetPy's automated field solver.

What are everyone's thoughts on this?

Also, once we get a good set of ideas together and a list of who's interested in spinning up the package, I'd be glad to write up a post on the pangeo discourse to see if they'd be interested in hosting it.

lpilz commented 3 years ago

@jthielen Thanks for picking this up. I'd definitely be interested in dedicating some time to this. Also, cf-xarray is probably more sensible than using cfdm, didn't know this project existed. Maybe cf-xarray could be added to this website.

Your ideas around this package are very nice, I just have some comments regarding two of them.

Regarding the handling of the diagnostic variables, I would propose a solution similar to @tjwixtrom's code. I'm not quite sure this is feasible, but I really like the idea of exposing all the diagnostics in a delayed fashion just waiting for the .compute() or .persist() call. My idea was: load the WRF dataset and have all diagnostics be ready as delayed variables on an unstaggered grid (grid destaggering also delayed). Then, all interpolations and further operations will be added to the task graph and only shortly before plotting, the fields would be computed. A far-away vision on which variables to implement, could be these ones: https://sundowner.colorado.edu/wrfout_to_cf/variables.html. In the code above, this would change the data loading to ds = xr.open_dataset("wrfout_d01_2021-01-01_00:00:00", engine="netcdf").wrf.parse_wrf() and everything else could stay the same if I'm not completely wrong.

With regards to geogrid_simulator, I would maybe propose bundling multiple of these WRF tools not immediately concerned with the simulation output transformation/handling into a separate repository just to avoid cluttering the library. Another addition could be a tool like https://github.com/lewisjared/wrfconf, however with all projections and a bit more flexibility regarding new namelist parameters.

jthielen commented 3 years ago

Sorry for loosing track of following up on this! As far as the follow-up points raised:

having all (or some standard list of default) diagnostic variables exposed in a delayed fashion could work too! I just would want to make sure to have the option to not always have all (for tidiness and efficiency).
I don't think that helper utilities like the geogrid_simulator would clutter the library, but that preference for more features in a single library rather than being spread out across libraries, is just that, my preference. Definitely wouldn't want something like that to hold up progress.

That being said, I wanted to mention what reminded me of this again!

I just stumbled across NCAR's new xwrf effort in a recent blog post, which takes another approach not mentioned here (a custom IO backend). I'm planning on reaching out to them to see if the functionality sought here would have a home there or if they'd be willing to shift their efforts towards a shared, community-led package (and how it relates wrf-python).

lpilz commented 3 years ago

No worries, I was busy the last weeks too so it actually kinda worked out.

I was thinking about using delayed variables because they essentially have zero overhead (except for the task graph creation). But of course one could also just filter the variables out using a preprocessor (cf. xwrf) if they are not needed.
I don't think that is a major point either, if you prefer it included, I'm fine with it

It's very interesting to see NCAR picking up WRF postprocessing again, good catch. However, I'm not sure if they will add the computation of diagnostic variables and automatic destaggering, which would be the core of this project. But it is definitely a good idea to ask them exactly these kinds of questions before proceeding. If you want, you can put me in CC of the email, I'd also be interested in their take on our points and initiative.

tjwixtrom commented 3 years ago

Tagging @mgrover1 who is involved with the NCAR xWRF effort.

mgrover1 commented 3 years ago

@andersy005 put together a prototype backend to help with reading WRF data into xarray, (xwrf) bypassing the requirement to use wrf-python to handle the file access. I put together that post to help illustrate how powerful this approach can be when integrating with the Pangeo stack (xarray, dask, and hvPlot within that example)! This approach improved IO performance, and opened the door to interfacing with Dask for calculations + plotting routines.

It is not clear yet what should all be included in xwrf... but I think having a more community-based approach would be great!

Perhaps setting up a time to chat about this would be helpful? We may want to include some of the GeoCAT developers (@michaelavs and @erogluorhan), who have been maintaining wrf-python.

erogluorhan commented 3 years ago

Hello, all the discussion here as well as the recent prototype xwrf effort of @andersy005 and blog posting of @mgrover1 are invaluable for Xarray/Dask-compatible WRF diagnostics. We have the objective of making WRF-Python have improved Xarray compatibility, which could potentially leverage xwrf or other discussions here, even though we could not take actions for that yet. We are happy to contribute any community-owned discussion and implementation.

jthielen commented 3 years ago

Thank you all for chiming in (looks like I don't have to reach out in a separate channel)!

Setting up a call as @mgrover1 suggests sounds like a good idea to me! Though, if that doesn't work out, perhaps we could organize asynchronous discussion in another location (perhaps GitHub discussions on the NCAR/xwrf repo)?

michaelavs commented 3 years ago

Hi everyone! As Orhan mentioned, we have talked about the future of wrf-python development (new functionality, new routines, etc) but have not yet had the chance to fully invest time into any major developments. I would definitely love being in the loop with these initiatives when it comes to expanding dask/xarray compatibility with wrf files (something I have seen asked a couple of times wrt wrf-python) and help bring wrf-python up to speed with these packages as well.

@mgrover and @jthielen, I would be open to/interested in a call where these items can be discussed, so please let me know any updates in that area!

lpilz commented 3 years ago

Hi everyone, great to see so much enthusiasm for new developments in wrf data postprocessing. I concur with @jthielen that a more discussion-friendly communications format would certainly be helpful and second @mgrover1's initiative for a call.

What I would be very interested in, is whether NCAR is aiming to continue developing wrf-python towards dask, supersede it with xwrf or is aiming to start over with a new package. This could also decide on where to continue the discussion, as we certainly don't want to overstay our welcome at @fmaussion's ;)

erogluorhan commented 3 years ago

Hi @lpilz , thanks for asking! There are pros & cons of either way (i.e. continuing with WRF-Python, or starting over a new package), and a few of them that immediately come into my mind are:

WRF-Python has a huge user base, e.g. 2500-3000 monthly visitors to the documentation page, highest number of downloads per version was more than 25K through Anaconda (v1.3.2), and all-time downloads of more than 440K even though it doesn't show individual users)
WRF-Python is based on compiled code (i.e. Fortran routines), i.e. cons like not pure Python implementation, no direct control when Dask comes into place, no no-arch releases, etc.
Starting over with xwrf or something else might give more design flexibility than sticking to existing WRF-Python architecture
Starting over a new package would require implementation of several diagnostics and interpolation functions while they are already available via WRF-Python (though I am not well aware of the functionality of tools such as salem).

Out of all of these, I am giving the biggest weight to the first one: user community of WRF-Python. It does not only provides us with a great opportunity of having such a big user base but also brings us (i.e. NCAR's GeoCAT team) the responsibility of maintaining the toolkit's existing functionality and improving it for the current users. That said, our GeoCAT team's preliminary plan would be keeping the existing functionality of WRF-Python for existing users as well as leveraging xwrf or other technologies to improve our Xarray compatibility and introduce Dask scalability.

However, we are open to other thoughts on this as well as to any extent of collaboration to take the best action for the community.

fmaussion / salem

[Feature] Add Dask integration #205