USGS-R / drb-inland-salinity-ml

Code repo for Delaware River Basin machine learning models that predict inland salinity.
Creative Commons Zero v1.0 Universal
3 stars 4 forks source link

220 seasonal shap #221

Closed jds485 closed 1 year ago

jds485 commented 1 year ago

This PR provides functions that split our dataset into:

1_fetch additions:

2_process additions:

4_predict additions:

_targets.R additions:

I also addressed #216, but do not have a solution to the problem. I added a function in generate_credentials.R that checks if the most recent credentials file was generated in the last 10 minutes. As noted in #216, the majority of the time is spent loading the dummy_var target that is used to trigger when the aws credentials should be built. As part of this process, I realized that many of the aws credentials targets could be deleted because the last line in the dependency target's function calls generate_credentials(). That helped to reduce the overall time spent on these aws credentials steps. I recommend that in the future we regenerate credentials as part of functions instead of as separate targets.

Closes #220, #216

jds485 commented 1 year ago

Here are example results from Urban and Forest. There are clear differences in feature importance rank, and also direction for some attributes (inverse pattern for TOT Clay).

Urban (TOT > 75%) SHAP_global_RF_static_dynamic_temporal_full_vars40

Forest (TOT > 75%) SHAP_global_RF_static_dynamic_temporal_full_vars40

jds485 commented 1 year ago

I'm curious what reaches are in each of the spatial data splits (land cover, physio region). I can make a map target for these splits. Would you prefer as a separate issue?

lekoenig commented 1 year ago

I'm curious what reaches are in each of the spatial data splits (land cover, physio region). I can make a map target for these splits. Would you prefer as a separate issue?

Unless you have a preference, I think that'd be helpful so that I can focus on the code changes in this PR.

jds485 commented 1 year ago

Okay, I'll make a separate issue to map these

jds485 commented 1 year ago

Thanks, Lauren! Good call on not running the target computations. I forgot to add time requirements in my initial post. The 3 full-dataset SHAP computations take a total of about 15 hrs to run + another 4 to make plots and summarize results.

Why are the intermediate credentials targets omitted here? Because you discovered generating these targets is not the time-limiting step?

The credentials targets that I deleted were all redundant because they were dependent on targets whose functions have a generate_credentials() call at the end.