Closed jds485 closed 2 years ago
Could you could remind me of the motivation/use case for labeling the data types?
The labels are for use in spatial-only holdouts. We can try to get a representative sample of reaches with continuous samplers and w/o continuous samplers for each split (train vs. test, and within training, the analysis vs. assessment sets in CV).
we should add a comment somewhere that describe what these refer to (maybe you already have that somewhere and I've just missed it)
I have comments in the lines above where the labels are assigned.
Thanks @lekoenig! I think I addressed all of your comments. The data labeled as continuous are now ~85% of the data, and those are located on ~25% of the reaches with observations. There are only 6/82 reaches with continuous data before the 2020 water year, so our prior temporal split did not have many of these reaches to train on. That definitely motivates doing spatial holdouts with these continuous-data reaches
The data labeled as continuous are now ~85% of the data, and those are located on ~25% of the reaches with observations. There are only 6/82 reaches with continuous data before the 2020 water year, so our prior temporal split did not have many of these reaches to train on. That definitely motivates doing spatial holdouts with these continuous-data reaches
Fascinating! Thanks for looking into this and for adding these changes.
This PR addresses 2 small issues related to downstream data processing.
p2_all_attr_SC_obs
to remove erroneous observation documented in #190. The filter should work if we update our pipeline with the WSC-corrected data. I think we could do that update now with minimal effect on processing time, but we may discover new data issues that I'd prefer to avoid at this time.Closes #190
data_type
column to each of the SC data sources that indicates if it is from a daily or continuous sampler. After aggregating to PRMS segments (taking the max of 'd' and 'u' to return 'u' if any days are from continuous), there are only 4209 site-days with continuous labels out of ~190k site-days. That's about equal to the number of site-days inp2_inst_data_daily
, but I'm still wondering if some of the NWIS daily data should also receive a continuous label (i.e., because the hourly data were processed to daily within NWIS). I'm not sure if we'd have a simple way to check that.Closes #191
I updated all targets before
p4
targets.