USGS-R / drb-inland-salinity-ml

Code repo for Delaware River Basin machine learning models that predict inland salinity.
Creative Commons Zero v1.0 Universal
3 stars 4 forks source link

Downstream data processing edits #194

Closed jds485 closed 2 years ago

jds485 commented 2 years ago

This PR addresses 2 small issues related to downstream data processing.

  1. adds a filter to p2_all_attr_SC_obs to remove erroneous observation documented in #190. The filter should work if we update our pipeline with the WSC-corrected data. I think we could do that update now with minimal effect on processing time, but we may discover new data issues that I'd prefer to avoid at this time.

Closes #190

  1. adds a data_type column to each of the SC data sources that indicates if it is from a daily or continuous sampler. After aggregating to PRMS segments (taking the max of 'd' and 'u' to return 'u' if any days are from continuous), there are only 4209 site-days with continuous labels out of ~190k site-days. That's about equal to the number of site-days in p2_inst_data_daily, but I'm still wondering if some of the NWIS daily data should also receive a continuous label (i.e., because the hourly data were processed to daily within NWIS). I'm not sure if we'd have a simple way to check that.
    length(which(p2_SC_observations$data_type == 'u'))
    [1] 4209

    Closes #191

I updated all targets before p4 targets.

jds485 commented 2 years ago

Could you could remind me of the motivation/use case for labeling the data types?

The labels are for use in spatial-only holdouts. We can try to get a representative sample of reaches with continuous samplers and w/o continuous samplers for each split (train vs. test, and within training, the analysis vs. assessment sets in CV).

we should add a comment somewhere that describe what these refer to (maybe you already have that somewhere and I've just missed it)

I have comments in the lines above where the labels are assigned.

jds485 commented 2 years ago

Thanks @lekoenig! I think I addressed all of your comments. The data labeled as continuous are now ~85% of the data, and those are located on ~25% of the reaches with observations. There are only 6/82 reaches with continuous data before the 2020 water year, so our prior temporal split did not have many of these reaches to train on. That definitely motivates doing spatial holdouts with these continuous-data reaches

lekoenig commented 2 years ago

The data labeled as continuous are now ~85% of the data, and those are located on ~25% of the reaches with observations. There are only 6/82 reaches with continuous data before the 2020 water year, so our prior temporal split did not have many of these reaches to train on. That definitely motivates doing spatial holdouts with these continuous-data reaches

Fascinating! Thanks for looking into this and for adding these changes.