220 seasonal shap - Githubissues

jds485 commented 1 year ago

This PR provides functions that split our dataset into:

water year seasons
land cover classes (high urban, high forest cover)
physiographic province
water year seasons by land cover class
water year seasons by physiographic province

1_fetch additions:

fetches the physiographic ecoregions from a SB item.
Adds the unique region names to each PRMS reach. This was a bit of a pain because reaches overlap multiple regions, and some regions were split into several polygons. I'm curious if you know of a better way to extract the ecoregion data than what I programmed.

2_process additions:

adds a target p2_TOT_lc_physio_attrs that contains only the TOT land cover classes and physiographic identifiers (binary for each region) for each SC observation.

4_predict additions:

adds a function setup_shap_data that completes the dataset splits and returns a list with one element per split. The target that uses this function is p4_shap_data_splits in 4_predict_plots.R.
adds a function get_shap_subset that filters the full dataframe of SHAP values for only the rows in each data split
adds a function get_shap_dir that gets the output file directory based on the data split name
set the Boruta and spatial train/test targets to never rebuild. These were triggered to rebuild because of the edits to generate_credentials

_targets.R additions:

adds directories for the SHAP analyses corresponding to the data splits

I also addressed #216, but do not have a solution to the problem. I added a function in generate_credentials.R that checks if the most recent credentials file was generated in the last 10 minutes. As noted in #216, the majority of the time is spent loading the dummy_var target that is used to trigger when the aws credentials should be built. As part of this process, I realized that many of the aws credentials targets could be deleted because the last line in the dependency target's function calls generate_credentials(). That helped to reduce the overall time spent on these aws credentials steps. I recommend that in the future we regenerate credentials as part of functions instead of as separate targets.

Closes #220, #216

jds485 commented 1 year ago

Here are example results from Urban and Forest. There are clear differences in feature importance rank, and also direction for some attributes (inverse pattern for TOT Clay).

Urban (TOT > 75%) SHAP_global_RF_static_dynamic_temporal_full_vars40

Forest (TOT > 75%) SHAP_global_RF_static_dynamic_temporal_full_vars40

jds485 commented 1 year ago

I'm curious what reaches are in each of the spatial data splits (land cover, physio region). I can make a map target for these splits. Would you prefer as a separate issue?

lekoenig commented 1 year ago

I'm curious what reaches are in each of the spatial data splits (land cover, physio region). I can make a map target for these splits. Would you prefer as a separate issue?

Unless you have a preference, I think that'd be helpful so that I can focus on the code changes in this PR.

jds485 commented 1 year ago

Okay, I'll make a separate issue to map these

jds485 commented 1 year ago

Thanks, Lauren! Good call on not running the target computations. I forgot to add time requirements in my initial post. The 3 full-dataset SHAP computations take a total of about 15 hrs to run + another 4 to make plots and summarize results.

Why are the intermediate credentials targets omitted here? Because you discovered generating these targets is not the time-limiting step?

The credentials targets that I deleted were all redundant because they were dependent on targets whose functions have a generate_credentials() call at the end.

USGS-R / drb-inland-salinity-ml

220 seasonal shap #221