NCAR / esmlab

Earth System Model Lab (esmlab). ⚠️⚠️ ESMLab functionality has been moved into <https://github.com/NCAR/geocat-comp>. ⚠️⚠️
https://esmlab.readthedocs.io
Apache License 2.0
24 stars 8 forks source link

Move general functionality upstream #157

Open andersy005 opened 5 years ago

andersy005 commented 5 years ago

While working on #156 and per discussion with @dcherian, I learned that esmlab's codebase is predominantly full of hacks.

I don't know the code that well at all (!) but this stuff looks sane.

I myself have started having a hard time understanding the existing codebase. It's been a while since I looked at esmlab's code base. So, I spent the last four days looking at the existing code base, and my take-away is that esmlab's codebase is full of hacks. And this makes it difficult to debug or even maintain :(

I am personally in favor of helping with pushing most of the general functionality into xarray and only keeping things that are domain-specific. I just found out that there has been recent progress in https://github.com/pydata/xarray/pull/2922, and I am excited about this. Once https://github.com/pydata/xarray/pull/2922 is merged, we can move most if not all existing functionality from https://github.com/NCAR/esmlab/blob/master/esmlab/statistics.py into xarray.

Originally posted by @andersy005 in https://github.com/NCAR/esmlab/pull/156#issuecomment-546997976

andersy005 commented 5 years ago

Yeah +1 on moving things upstream.

From a quick look, it seems like esmlab basically adds time-aware weighted groupby/resampling; time bounds consistency; and time units/calendargpropagation (https://github.com/pydata/xarray/issues/1614).

For the last one, it would be good to document how much of the table in https://github.com/pydata/xarray/issues/1614 is implemented and what's left.

Originally posted by @dcherian in https://github.com/NCAR/esmlab/pull/156#issuecomment-547004796

andersy005 commented 5 years ago

@dcherian

From a quick look, it seems like esmlab basically adds time-aware weighted groupby/resampling; time bounds consistency; and time units/calendargpropagation

This is correct.

We are relying on the time bounds for all time-aware operations.

1) Decoding Time: Internally, esmlab decodes time it by using the time bounds variable. Xarray doesn't seem to be aware of time-bounds variable when decoding time:

In [1]: import xarray as xr                                                                                                                        

In [2]: import numpy as np                                                                                                                         

   ...:     ds['variable_1'] = xr.DataArray( 
   ...:         np.append( 
   ...:             np.zeros([12, 2, 2], dtype='float32'), np.ones([12, 2, 2], dtype='float32'), axis=0 
   ...:         ), 
   ...:         dims=['time', 'lat', 'lon'], 
   ...:     ) 
   ...:     ds['variable_2'] = xr.DataArray( 
   ...:         np.append( 
   ...:             np.ones([12, 2, 2], dtype='float32'), np.zeros([12, 2, 2], dtype='float32'), axis=0 
   ...:         ), 
   ...:         dims=['time', 'lat', 'lon'], 
   ...:     ) 
   ...:     ds.time.attrs['units'] = 'days since 0001-01-01 00:00:00' 
   ...:     ds.time.attrs['calendar'] = 'noleap' 
   ...:     ds.time.attrs['bounds'] = 'time_bound' 
   ...:     return ds.copy(True) 
   ...:                                                                                                                                            

In [4]: ds = create_dataset()                                                                                                                      

In [5]: ds                                                                                                                                         
Out[5]: 
<xarray.Dataset>
Dimensions:     (d2: 2, lat: 2, lon: 2, time: 24)
Coordinates:
  * lat         (lat) int64 0 1
  * lon         (lon) int64 0 1
  * time        (time) float64 31.0 59.0 90.0 120.0 ... 638.0 669.0 699.0 730.0
  * d2          (d2) int64 0 1
Data variables:
    time_bound  (time, d2) float64 0.0 31.0 31.0 59.0 ... 699.0 699.0 730.0
    variable_1  (time, lat, lon) float32 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0
    variable_2  (time, lat, lon) float32 1.0 1.0 1.0 1.0 1.0 ... 0.0 0.0 0.0 0.0

In [6]: xr.decode_cf(ds)                                                                                                                           
Out[6]: 
<xarray.Dataset>
Dimensions:     (d2: 2, lat: 2, lon: 2, time: 24)
Coordinates:
  * lat         (lat) int64 0 1
  * lon         (lon) int64 0 1
  * time        (time) object 0001-02-01 00:00:00 ... 0003-01-01 00:00:00
  * d2          (d2) int64 0 1
Data variables:
    time_bound  (time, d2) object ...
    variable_1  (time, lat, lon) float32 ...
    variable_2  (time, lat, lon) float32 ...

Notice how the time axis is shifted to the right by one month.

Ideally, you would want the time to be:

In [10]: import cftime            
In [14]: cftime.num2date(ds.time_bound.mean(dim='d2'), units='days since 0001-01-01 00:00:00', calendar='noleap')                                  
Out[14]: 
array([cftime.DatetimeNoLeap(0001-01-16 12:00:00),
       cftime.DatetimeNoLeap(0001-02-15 00:00:00),
       cftime.DatetimeNoLeap(0001-03-16 12:00:00),
       cftime.DatetimeNoLeap(0001-04-16 00:00:00),
       cftime.DatetimeNoLeap(0001-05-16 12:00:00),
       cftime.DatetimeNoLeap(0001-06-16 00:00:00),
       cftime.DatetimeNoLeap(0001-07-16 12:00:00),
       cftime.DatetimeNoLeap(0001-08-16 12:00:00),
       cftime.DatetimeNoLeap(0001-09-16 00:00:00),
       cftime.DatetimeNoLeap(0001-10-16 12:00:00),
       cftime.DatetimeNoLeap(0001-11-16 00:00:00),
       cftime.DatetimeNoLeap(0001-12-16 12:00:00),
       cftime.DatetimeNoLeap(0002-01-16 12:00:00),
       cftime.DatetimeNoLeap(0002-02-15 00:00:00),
       cftime.DatetimeNoLeap(0002-03-16 12:00:00),
       cftime.DatetimeNoLeap(0002-04-16 00:00:00),
       cftime.DatetimeNoLeap(0002-05-16 12:00:00),
       cftime.DatetimeNoLeap(0002-06-16 00:00:00),
       cftime.DatetimeNoLeap(0002-07-16 12:00:00),
       cftime.DatetimeNoLeap(0002-08-16 12:00:00),
       cftime.DatetimeNoLeap(0002-09-16 00:00:00),
       cftime.DatetimeNoLeap(0002-10-16 12:00:00),
       cftime.DatetimeNoLeap(0002-11-16 00:00:00),
       cftime.DatetimeNoLeap(0002-12-16 12:00:00)], dtype=object)

Is this something that is reasonable to implement in xarray or is too domain-specific?

  1. Wrong results when applying ops such as resample() on the time axis. Because of the way xarray decodes the time, the user ends up getting wrong results when they try to do some operations along the time axis:

    • For instance, when applying annual resampling on the above dataset, we end up getting three years instead of two years:
      
      In [19]: ds.resample(time='A').mean()                                                                                                              
      Out[19]: 
      <xarray.Dataset>
      Dimensions:     (d2: 2, lat: 2, lon: 2, time: 3)
      Coordinates:
    • time (time) object 0001-12-31 00:00:00 ... 0003-12-31 00:00:00
    • lat (lat) int64 0 1
    • d2 (d2) int64 0 1
    • lon (lon) int64 0 1 Data variables: variable_1 (time, lat, lon) float32 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 variable_2 (time, lat, lon) float32 1.0 1.0 1.0 1.0 ... 0.0 0.0 0.0 0.0
  2. xarray discards the time_bounds variable. As a user, I would expect xarray to keep the time_bounds variable, and generate new values for it accordingly. For instance:

In [26]: esmlab.resample(create_dataset(), freq='ann')                                                                                             
Out[26]: 
<xarray.Dataset>
Dimensions:     (d2: 2, lat: 2, lon: 2, time: 2)
Coordinates:
  * lat         (lat) int64 0 1
  * lon         (lon) int64 0 1
  * time        (time) float64 181.7 546.7
  * d2          (d2) int64 0 1
Data variables:
    time_bound  (d2, time) float64 0.0 365.0 365.0 730.0
    variable_1  (time, lat, lon) float64 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
    variable_2  (time, lat, lon) float64 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0

Is it reasonable to have this in xarray or is domain specific (i.e. should we handle this in esmlab?) ?

andersy005 commented 5 years ago

xarray discards the time_bounds variable. As a user, I would expect xarray to keep the time_bounds variable, and generate new values for it accordingly.

https://github.com/NCAR/esmlab/issues/142 proposes a method to generate the new time-coordinate.

dcherian commented 5 years ago
  1. Is this something that is reasonable to implement in xarray or is too domain-specific?

    It is inconvenient but I don't think its wrong. The file is intentionally saved with time at the end of the month. Why wasn't it saved with the midpoint of time bounds in the first place? In any case, this won't fly (https://github.com/pydata/xarray/issues/2231#issuecomment-397167282). xarray basically does not deal with relationships between variables and that is unlikely to change. Our only exception (I think) is making sure bounds variables are encoded and decoded with the same units as the grid variable.

  2. As above.

  3. This looks like a bug to me. As a data variable, time_bounds should not be getting dropped. (coordinate variables get dropped https://github.com/pydata/xarray/issues/2944 but I think that should change). Your code is incomplete (missing def create_dataset...). can you update that; i'll take a look.

Some of this might get solved eventually (https://github.com/pydata/xarray/issues/1475) when xarray adds more fancy indexes but that's a longer term project.

I should be clear: esmlab adds very useful capabilities and a package like it is definitely required. It's just that general things like machinery for weighted operations should be moved upstream.

andersy005 commented 5 years ago

Your code is incomplete (missing def create_dataset...). can you update that; i'll take a look.

My bad. I partially pasted the code. Here's the complete version:

import xarray as xr
import numpy as np
def create_dataset():
    start_date = np.array([0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334], dtype=np.float64)
    start_date = np.append(start_date, start_date + 365)
    end_date = np.array([31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334, 365], dtype=np.float64)
    end_date = np.append(end_date, end_date + 365)
    ds = xr.Dataset(coords={'time': 24, 'lat': 2, 'lon': 2, 'd2': 2})
    ds['time'] = xr.DataArray(end_date, dims='time')
    ds['lat'] = xr.DataArray([0, 1], dims='lat')
    ds['lon'] = xr.DataArray([0, 1], dims='lon')
    ds['d2'] = xr.DataArray([0, 1], dims='d2')
    ds['time_bound'] = xr.DataArray(
        np.array([start_date, end_date]).transpose(), dims=['time', 'd2']
    )
    ds['variable_1'] = xr.DataArray(
        np.append(
            np.zeros([12, 2, 2], dtype='float32'), np.ones([12, 2, 2], dtype='float32'), axis=0
        ),
        dims=['time', 'lat', 'lon'],
    )
    ds['variable_2'] = xr.DataArray(
        np.append(
            np.ones([12, 2, 2], dtype='float32'), np.zeros([12, 2, 2], dtype='float32'), axis=0
        ),
        dims=['time', 'lat', 'lon'],
    )
    ds.time.encoding['units'] = 'days since 0001-01-01 00:00:00'
    ds.time.encoding['calendar'] = 'noleap'
    ds.time.encoding['bounds'] = 'time_bound'
    return ds
andersy005 commented 5 years ago

Why wasn't it saved with the midpoint of time bounds in the first place?

I agree with you. This is a CESM issue and not a general issue, and it makes more sense to address it at the source.

andersy005 commented 5 years ago

Or we could just have a separate minimal package to handle the strange behaviors/inconsistencies of CESM output that are problematic when using xarray.

dcherian commented 4 years ago

https://github.com/pydata/xarray/pull/2922 could use some testing if someone's got the time

andersy005 commented 4 years ago

@dcherian, I took pydata/xarray#2922 for a test drive. From the results I am getting, it looks good to me:

Screen Shot 2019-12-11 at 11 49 29 AM