DOI-USGS / national-flow-observations

This repository pulls national flow data from NWIS
Other
4 stars 8 forks source link

Choosing flow columns #5

Closed limnoliver closed 4 years ago

limnoliver commented 4 years ago

In some instances, there are multiple flow columns returned for a site. This usually happens when there are 1) multiple sensors deployed, 2) something about the sensor changed through time. Clues to why there are different columns are in the column names. With temperature for example, if there is a thermistor chain measuring temperature at multiple depths, the columns will be appended with _0m_depth, _2m_depth, and so on.

The way we condensed this data for the stream temperature workflow was to just select the column that had the most data for each site. This sort of assumes multiple sensors are deployed, and just picks the one with the most data. We could lose data in cases where the different sensors were deployed at different times, and therefore the period of record for that site could look smaller than it is in reality.

@lindsayplatt, it seems for your work, you would need an inventory of any flow data at each site. We could keep all data in the initial combine steps, and we could reduce the data later when we get to the modeling step. Thoughts?

lindsayplatt commented 4 years ago

Yes, I think we wouldn't want to pick the column with the most data for our need bc we could lose earlier data and we really want an accurate depiction of the period of record. This might be where we need to start branching out based on needs.

For our needs, we could make a function that reduces the flow info into period of record using multiple flow columns. And then for the main goal of the repo which is to be able to use the flow data, you could use the approach of keeping the column with the most observations.

limnoliver commented 4 years ago

Solution: pipeline currently builds a long data frame, where if there are multiple flow columns for a site, they get stacked and the column name is retained.

For modeling purposes, we still need to reduce those data by selecting which column name we want to retain. Daily values are calculated using group_by(site, col_name, date), so the multiple column or sensor issue is retrained through the daily values.