tidy netcdf with xarray

msleckman commented 2 years ago

@lekoenig the subset_nc_to_comid.py processes slowly for me in our targets pipeline. This may tied to reticulate, but one idea I have is to simplify the ds_to_dataframe_faster() and replace it with xarray.to_dataframe(). Have you tried this function? It's more of a "blackbox" xarray tidying function that does what you have built with ds_to_dataframe_faster() but it processes the netcdf quite a bit faster (see time tests below - i tested across different time frames to see how it scales).

If the results below work for you, I can go ahead and make and test this code modification in my branch addressing #37.

# function from subset_nc_to_comid.py 
def ds_to_dataframe_faster(ds):
    """
    doing this to try to avoid the multi-index joins
    """
    series_list = []
    for var in ds.data_vars:
        if "lat" not in var and "lon" not in var:
            df_var = ds[var].to_pandas().reset_index()
            df_var_tidy = df_var.melt(id_vars='COMID', value_name=var)
            series_list.append(df_var_tidy[var])
    series_list.append(df_var_tidy['COMID'])
    series_list.append(df_var_tidy['time'])
    return pd.concat(series_list, axis=1)

# timing function
def time_xr_tidying(ds):
    start = process_time()
    ds_fun = ds_to_dataframe_faster(ds)
    end = process_time()
    print('elapsed time w/function:', end - start)

    start = process_time()
    ds_df = ds.to_dataframe().reset_index()
    end = process_time()
    print('elapsed time w/ .to_dataframe():', end - start)

# time testing
nc_file  = 'drb_climate_2022_06_14.nc'
ds = xr.open_dataset(nc_file, decode_times=True)

# subset - 1 yr
ds_1yr = ds.sel(time = slice('1979-01-01','1980-01-01'))
# subset - 10 yr
ds_10yr = ds.sel(time = slice('1979-01-01','1989-01-01'))
# subset - 20 yr
ds_20yr = ds.sel(time = slice('1979-01-01','1999-01-01'))

> time_xr_tidying(ds_1yr)
elapsed time w/function: 8.313794999999999
elapsed time w/ .to_dataframe(): 0.9057740000000081

> time_xr_tidying(ds_10yr)
elapsed time w/function: 112.05785399999999
elapsed time w/ .to_dataframe(): 33.176807

> time_xr_tidying(ds_20yr)
elapsed time w/function: 233.733969
elapsed time w/ .to_dataframe(): 86.80633

output summary

The output are identical (the to_dataframe() process keeps hru_lat and hur_lon columns so shapes differ in the number of cols only (ds_df.shape = (5904312, 12) vs. ds_fun.shape = (5904312, 10))

Shown below are summary stats + example plot for 1 var. All other variables had same output on terms of both datasets being identical.

> grouped_ds_df = ds_df.groupby('time').mean()
> grouped_ds_fun = ds_fun.groupby('time').mean()
> print(grouped_ds_fun.drop('COMID', axis = 1).describe())
> print(grouped_ds_df.drop(['COMID','hru_lat','hru_lon'], axis = 1).describe())

             tmmx        tmmn          pr  ...        rmax        rmin         sph
count  366.000000  366.000000  366.000000  ...  366.000000  366.000000  366.000000
mean    60.254989   40.821050    0.150591  ...   85.017617   45.119070    0.006339
std     18.341400   17.193906    0.277693  ...   12.213456    8.979670    0.003999
min     10.863673  -13.021575    0.000000  ...   43.492542   15.085517    0.000502
25%     46.622133   28.374595    0.001501  ...   77.554284   39.408278    0.002895
50%     62.686404   41.535074    0.030448  ...   87.578144   45.064299    0.005557
75%     75.340224   54.036601    0.156937  ...   95.402246   52.055688    0.009510
max     88.576795   70.733331    1.751408  ...  100.000000   67.168022    0.015907
[8 rows x 8 columns]
             tmmx        tmmn          pr  ...        rmax        rmin         sph
count  366.000000  366.000000  366.000000  ...  366.000000  366.000000  366.000000
mean    60.254989   40.821050    0.150591  ...   85.017617   45.119070    0.006339
std     18.341400   17.193906    0.277693  ...   12.213456    8.979670    0.003999
min     10.863673  -13.021575    0.000000  ...   43.492542   15.085517    0.000502
25%     46.622133   28.374595    0.001501  ...   77.554284   39.408278    0.002895
50%     62.686404   41.535074    0.030448  ...   87.578144   45.064299    0.005557
75%     75.340224   54.036601    0.156937  ...   95.402246   52.055688    0.009510
max     88.576795   70.733331    1.751408  ...  100.000000   67.168022    0.015907
[8 rows x 8 columns]

lekoenig commented 2 years ago

Thanks, @msleckman! This sounds really useful for speeding up our build time for the met data. I'm kind of confused where we would implement this swap though - for example, I tried replacing ds_to_dataframe_faster(ds_comids) with ds.to_dataframe(ds_comids).reset_index() in the subset_nc_to_comids() function (in 2_process/src/subset_nc_to_comid.py), but I got an error when I tried to run that. I'm probably misinterpreting how we'd use this other xarray function and so any tips you have would be great 🙂

lekoenig commented 2 years ago

I was able to modify the function to incorporate your suggestions and use ds.to_dataframe(ds_comids).reset_index() instead of calling our previously-defined ds_to_dataframe_faster() function.

def subset_nc_to_comids(nc_file, comids):
    comids = [int(c) for c in comids]

    ds = xr.open_dataset(nc_file, decode_times = True)

    # filter out comids that are not in climate drivers (should only be 4781767)
    comids = np.array(comids)
    comids_in_climate = comids[np.isin(comids, ds.COMID.values)]
    comids_not_in_climate = comids[~np.isin(comids, ds.COMID.values)]
    print(comids_not_in_climate)

    # We know of one COMID that has no catchment and so should be included
    # in `comids_not_in_climate` if passed through in `comids`. Use assert 
    # statement to make sure we are aware of any others. COMIDs within 
    # `comids_not_in_climate` will not have matched climate data. 
    if len(comids_not_in_climate) > 0 :
        assert list(comids_not_in_climate) == [4781767]
    ds_comids = ds.sel(COMID=comids_in_climate)
    # [Lauren] we have been using a function written by Jeff Sadler for the DRB
    # PGDL-DO project to process the xarray object to a ~tiday data frame. Below
    # I've replaced ds_to_dataframe_faster(ds_comids) with a more generic function
    # to speed up the run time. See this issue for further details: 
    # https://github.com/USGS-R/drb-gw-hw-model-prep/issues/44.
    ds_comids_df = ds_comids.to_dataframe().reset_index()
    return ds_comids_df

However, I don't notice great time improvements like you report in your examples above. The build time was previously ~34 min for me:

> tar_meta() %>% filter(name == "p2_met_data_nhd_mainstem_reaches") %>% pull(seconds)/60
[1] 33.42533

And with the changes to subset_nc_to_comids(), it's ~30 min:

> tar_meta() %>% filter(name == "p2_met_data_nhd_mainstem_reaches") %>% pull(seconds)/60
[1] 29.52967
>

Did you have other edits in mind besides what I pasted here?

lekoenig commented 1 year ago

@msleckman I made an attempt to incorporate your suggestions (see above) but didn't see much improvement in the build time. So I've unassigned myself from this issue and added a wontfix label. I'm not sure if my attempt fully captured what you had in mind, so if you're able to implement this please feel free to do so.

lekoenig commented 1 year ago

Closing this issue for now as wontfix.

USGS-R / drb-gw-hw-model-prep

tidy netcdf with xarray #44

output summary