Closed msleckman closed 1 year ago
Thanks, @msleckman! This sounds really useful for speeding up our build time for the met data. I'm kind of confused where we would implement this swap though - for example, I tried replacing ds_to_dataframe_faster(ds_comids)
with ds.to_dataframe(ds_comids).reset_index()
in the subset_nc_to_comids()
function (in 2_process/src/subset_nc_to_comid.py
), but I got an error when I tried to run that. I'm probably misinterpreting how we'd use this other xarray function and so any tips you have would be great 🙂
I was able to modify the function to incorporate your suggestions and use ds.to_dataframe(ds_comids).reset_index()
instead of calling our previously-defined ds_to_dataframe_faster()
function.
def subset_nc_to_comids(nc_file, comids):
comids = [int(c) for c in comids]
ds = xr.open_dataset(nc_file, decode_times = True)
# filter out comids that are not in climate drivers (should only be 4781767)
comids = np.array(comids)
comids_in_climate = comids[np.isin(comids, ds.COMID.values)]
comids_not_in_climate = comids[~np.isin(comids, ds.COMID.values)]
print(comids_not_in_climate)
# We know of one COMID that has no catchment and so should be included
# in `comids_not_in_climate` if passed through in `comids`. Use assert
# statement to make sure we are aware of any others. COMIDs within
# `comids_not_in_climate` will not have matched climate data.
if len(comids_not_in_climate) > 0 :
assert list(comids_not_in_climate) == [4781767]
ds_comids = ds.sel(COMID=comids_in_climate)
# [Lauren] we have been using a function written by Jeff Sadler for the DRB
# PGDL-DO project to process the xarray object to a ~tiday data frame. Below
# I've replaced ds_to_dataframe_faster(ds_comids) with a more generic function
# to speed up the run time. See this issue for further details:
# https://github.com/USGS-R/drb-gw-hw-model-prep/issues/44.
ds_comids_df = ds_comids.to_dataframe().reset_index()
return ds_comids_df
However, I don't notice great time improvements like you report in your examples above. The build time was previously ~34 min for me:
> tar_meta() %>% filter(name == "p2_met_data_nhd_mainstem_reaches") %>% pull(seconds)/60
[1] 33.42533
And with the changes to subset_nc_to_comids()
, it's ~30 min:
> tar_meta() %>% filter(name == "p2_met_data_nhd_mainstem_reaches") %>% pull(seconds)/60
[1] 29.52967
>
Did you have other edits in mind besides what I pasted here?
@msleckman I made an attempt to incorporate your suggestions (see above) but didn't see much improvement in the build time. So I've unassigned myself from this issue and added a wontfix
label. I'm not sure if my attempt fully captured what you had in mind, so if you're able to implement this please feel free to do so.
Closing this issue for now as wontfix
.
@lekoenig the
subset_nc_to_comid.py
processes slowly for me in our targets pipeline. This may tied to reticulate, but one idea I have is to simplify theds_to_dataframe_faster()
and replace it with xarray.to_dataframe(). Have you tried this function? It's more of a "blackbox" xarray tidying function that does what you have built withds_to_dataframe_faster()
but it processes the netcdf quite a bit faster (see time tests below - i tested across different time frames to see how it scales).If the results below work for you, I can go ahead and make and test this code modification in my branch addressing #37.
output summary
The output are identical (the to_dataframe() process keeps
hru_lat
andhur_lon
columns so shapes differ in the number of cols only (ds_df.shape = (5904312, 12)
vs.ds_fun.shape = (5904312, 10)
)Shown below are summary stats + example plot for 1 var. All other variables had same output on terms of both datasets being identical.