USGS-R / delaware-model-prep

Data and scripts for collecting and formatting data in the Delaware River Basin in prep for ML and DA modeling
2 stars 13 forks source link

Bad flow observations #89

Open jsadler2 opened 3 years ago

jsadler2 commented 3 years ago

I found some really wonky flow observations in obs_flow_full.csv. There are 108 observations that have ~-14150 as the value: MicrosoftTeams-image (12)

jsadler2 commented 3 years ago

Any ideas why this is happening?

jsadler2 commented 3 years ago

It doesn't seem like these are the values in NWIS 🤔 : image

jzwart commented 3 years ago

Strange. looks like the negative discharges might be offset?

bad = dplyr::filter(d, discharge_cms < -1000)
plot(bad$discharge_cms~as.Date(bad$date), type = 'o')

image

limnoliver commented 3 years ago

Hey Jeff - trying to track down more info here. Maybe this is an out of date file? I can't find where it's being created in the pipeline. The flow files are now generated from the national flow pull and then subsetted to the DRB here. This is a good reminder (for myself) to periodically clean up the google drive associated with the project.

jsadler2 commented 3 years ago

@limnoliver - obs_flow_full.csv is built here which depends on what you linked to above. I just built obs_flow_drb.rds and I see the same bad data.

limnoliver commented 3 years ago

Thanks Jeff! No wonder I couldn't find it in 2_observations. Will investigate!

limnoliver commented 3 years ago

Okay, issue partially figured out. My first clue was that the site ID was listed twice, which means there were two unique values on that day, and the data were being aggregated in some way (happening here).

Some site-parameter code combos return multiple columns when you retrieve from NWIS. This site, for example, when you pull using data retrieval, looks like this:

test <- dataRetrieval::readNWISdv(siteNumbers = '01465500', parameterCd =  '00060')

image

...which likely means discharge is being measured at two locations at the site. Usually in the national temperature pipeline pulls, I pick the "best" column by choosing the column with the most data when I have to (e.g., when there are more than one observation at that site-day). My guess is that we didn't handle this in the national flow pipeline, and so both columns were being passed and then averaged. In theory, I think this is okay, except for the fact that one of those columns had some -999999.0 values, which I assume is an error code.

The weird part is that these -999999.0 values exist in the national pull data (from 2_observations/in/daily_flow.rds) but I can't recreate them from the above NWIS pull. Maybe they were fixed sometime between the national flow pull (~10 months ago) and now?

limnoliver commented 3 years ago

And just confirming, this appears to be what's happening in the flow pipeline - note here the column selection part is commented out, and then col_name is being dropped when data from uv and dv are bound together.

jsadler2 commented 3 years ago

The weird part is that these -999999.0 values exist in the national pull data (from 2_observations/in/daily_flow.rds) but I can't recreate them from the above NWIS pull. Maybe they were fixed sometime between the national flow pull (~10 months ago) and now?

That is weird. It's kind of comforting that there aren't those values, but also not since now it's a phantom problem.

aappling-usgs commented 3 years ago

For my postdoc on metabolism estimation, we re-pulled input data from NWIS about a year after the initial pull and saw groups of sites where whole sections of data changed - one change I remember seemed to have to do with correcting a timezone issue, and I think there were also cases where data that had initially been available but weird were taken off NWIS entirely. So I'm not surprised that there might be similar cases in the discharge data for our current projects.