USGS-R / drb-inland-salinity-ml

Code repo for Delaware River Basin machine learning models that predict inland salinity.
Creative Commons Zero v1.0 Universal
3 stars 4 forks source link

Updated functions to handle spatial splits #207

Closed jds485 closed 1 year ago

jds485 commented 2 years ago

This PR adds options to split data by spatial information (reaches in this case). It is applied to the training and testing split, as well as to CV folds in parameter tuning. The main functions for this are in train_models.R: make_spatial_split, make_spatial_split_CVtraining, and assign_spatial_split. The first and second functions are similar to the temporal split function. The last function is used to ensure consistency in the training and testing sets across all models that will be compared.

I set the random and temporal targets to never rebuild because there were function edits that would trigger them to rebuild.

Additional visualization edits:

Closes #205

jds485 commented 1 year ago

Thanks! Yes, accounting for spatial nestedness is on my list. I like the suggestion to define sub-watersheds or regions to hold out. One of the challenges I've been thinking about is how to ensure a proportional amount of discrete and continuous sampling data within each sub-region. Maybe we can just accept that some sub-regions will have different amounts (and that is likely for prediction in other basins, anyway).