NCAR / cesm-lens-aws

Examples of analysis of CESM LENS data publicly available on Amazon S3 (us-west-2 region) using xarray and dask
https://doi.org/10.26024/wt24-5j82
BSD 3-Clause "New" or "Revised" License
43 stars 23 forks source link

Interesting (poor) dask behavior in #28

Closed jhamman closed 4 years ago

jhamman commented 4 years ago

I'm noticing some odd behavior in the example notebook with regard to dask and some transpose logic. I'm not 100% sure this was here before the move to intake-esm so I'm wondering if we're combining datasets in a sub-optimal way. The offending cell is below. I'm about to give a tutorial so I can't dig deeper right now but I'm putting this up so we can come back to it.

winter_seasons = seasons.sel(
    time=seasons.time.where(seasons.time.dt.month == 12, drop=True)
)
winter_trends = linear_trend(
    winter_seasons.chunk({"lat": 20, "lon": 20, "time": -1})
).load() * len(winter_seasons.time)

# Compute ensemble mean from the first 30 members
winter_trends_mean = winter_trends.isel(member_id=range(30)).mean(dim="member_id")
winter_trends_mean

image

andersy005 commented 4 years ago

I'm not 100% sure this was here before the move to intake-esm so I'm wondering if we're combining datasets in a sub-optimal way.

In this particular case, we are not merging/concatenating the zarr stores. Each zarr store ends up in its own xarray dataset:

Screen Shot 2019-12-10 at 1 49 09 PM

After further investigations, I can confirm that intake-esm is not the culprit here. I am able to reproduce this behavior without intake-esm.

import fsspec
import xarray as xr

options = dict(anon=True)

ds_20C = xr.open_zarr(fsspec.get_mapper('s3://ncar-cesm-lens/atm/daily/cesmLE-20C-TREFHT.zarr', **options), consolidated=True)
ds_RCP85 = xr.open_zarr(fsspec.get_mapper('s3://ncar-cesm-lens/atm/daily/cesmLE-RCP85-TREFHT.zarr', **options), consolidated=True)

Screen Shot 2019-12-10 at 1 38 34 PM

Dependency versions

numpy      1.17.3
fsspec     0.6.1
xarray     0.14.1
intake     0.5.3
intake_esm 2019.10.15.post92
dask       2.9.0

Here's the notebook used to reproduce this behavior.

Unfortunately, I can't really tell whether this behavior has always been there and/or it is caused by recent changes in dask/xarray.

jeffdlb commented 4 years ago

I believe each chunk (at least for TREFHT) contains the entire 288x192 lat/lon range, so does this line mean each chunk ends up getting read (288/20)*(192/20)=150 times, or even 20x20=400 times?

winter_trends = linear_trend( winter_seasons.chunk({"lat": 20, "lon": 20, "time": -1}) ).load() * len(winter_seasons.time)

andersy005 commented 4 years ago

Closing this issue for the time being...My current speculation is that it has to do with xarray's apply_ufunc, but I was unable to determine the exact root cause. I may re-open it in the future :)

dcherian commented 4 years ago

Could use da.polyfit now and see if the same behaviour persists.

andersy005 commented 4 years ago

Could use da.polyfit now and see if the same behaviour persists.

👍 I had forgotten about da.polyfit. I will look into it tomorrow.