Downstream data processing edits

jds485 commented 2 years ago

This PR addresses 2 small issues related to downstream data processing.

adds a filter to p2_all_attr_SC_obs to remove erroneous observation documented in #190. The filter should work if we update our pipeline with the WSC-corrected data. I think we could do that update now with minimal effect on processing time, but we may discover new data issues that I'd prefer to avoid at this time.

Closes #190

adds a data_type column to each of the SC data sources that indicates if it is from a daily or continuous sampler. After aggregating to PRMS segments (taking the max of 'd' and 'u' to return 'u' if any days are from continuous), there are only 4209 site-days with continuous labels out of ~190k site-days. That's about equal to the number of site-days in p2_inst_data_daily, but I'm still wondering if some of the NWIS daily data should also receive a continuous label (i.e., because the hourly data were processed to daily within NWIS). I'm not sure if we'd have a simple way to check that.
```
length(which(p2_SC_observations$data_type == 'u'))
[1] 4209
```
Closes #191

I updated all targets before p4 targets.

jds485 commented 2 years ago

Could you could remind me of the motivation/use case for labeling the data types?

The labels are for use in spatial-only holdouts. We can try to get a representative sample of reaches with continuous samplers and w/o continuous samplers for each split (train vs. test, and within training, the analysis vs. assessment sets in CV).

we should add a comment somewhere that describe what these refer to (maybe you already have that somewhere and I've just missed it)

I have comments in the lines above where the labels are assigned.

jds485 commented 2 years ago

Thanks @lekoenig! I think I addressed all of your comments. The data labeled as continuous are now ~85% of the data, and those are located on ~25% of the reaches with observations. There are only 6/82 reaches with continuous data before the 2020 water year, so our prior temporal split did not have many of these reaches to train on. That definitely motivates doing spatial holdouts with these continuous-data reaches

lekoenig commented 2 years ago

The data labeled as continuous are now ~85% of the data, and those are located on ~25% of the reaches with observations. There are only 6/82 reaches with continuous data before the 2020 water year, so our prior temporal split did not have many of these reaches to train on. That definitely motivates doing spatial holdouts with these continuous-data reaches

Fascinating! Thanks for looking into this and for adding these changes.

USGS-R / drb-inland-salinity-ml

Downstream data processing edits #194