DOI-USGS / dataRetrieval

This R package is designed to obtain USGS or EPA water quality sample data, streamflow data, and metadata directly from web services.
https://doi-usgs.github.io/dataRetrieval/
Other
259 stars 84 forks source link

New WQP summary function #589

Closed ldecicco-USGS closed 2 years ago

ldecicco-USGS commented 2 years ago

This is a pull request for a new WQP summary service.

I'm still not sure if we want to make this it's own separate function, or have it be an optional output of the current whatWQPdata function. Also not sure what the name should be. When the package is loaded with this new function, you can now do this:

lake_sites <- whatWQPdata2(siteType = "Lake, Reservoir, Impoundment",
                         countycode = "US:55:025")
names(lake_sites)
 [1] "Provider"                          
 [2] "MonitoringLocationIdentifier"      
 [3] "YearSummarized"                    
 [4] "CharacteristicType"                
 [5] "CharacteristicName"                
 [6] "ActivityCount"                     
 [7] "ResultCount"                       
 [8] "LastResultSubmittedDate"           
 [9] "OrganizationIdentifier"            
[10] "OrganizationFormalName"            
[11] "MonitoringLocationName"            
[12] "MonitoringLocationTypeName"        
[13] "ResolvedMonitoringLocationTypeName"
[14] "HUCEightDigitCode"                 
[15] "MonitoringLocationUrl"             
[16] "CountyName"                        
[17] "StateName"                         
[18] "MonitoringLocationLatitude"        
[19] "MonitoringLocationLongitude" 

The cool thing is we can now get information on the CharacteristicName back without grabbing ALL THE DATA.

unique(lake_sites$CharacteristicName)
  [1] "Depth, Secchi disk depth"                                    
  [2] "Height, gage"                                                
  [3] "Cadmium"                                                     
  [4] "Copper"                                                      
  [5] "Lead"     
...
[275] "Phytoplankton Density"                                       
[276] "Volume, total"                                               
[277] "Weather condition (WMO code 4501) (choice list)"     

So let's say we want to get all the long-term USGS phosphorous stream data in Wisconsin...this is the example set up in the tutorial for NWIS here: http://usgs-r.github.io/dataRetrieval/articles/tutorial.html#wisconsin-example For WQP, that's historically been harder to prescreen. With this new function:

library(tidyverse)
phos_data <- whatWQPdata2(siteType = "Stream",
                          statecode = "WI",
                          CharacteristicName = "Phosphorus")

library(dplyr)
phWI.1 <- phos_data %>% 
  group_by(MonitoringLocationIdentifier) |> 
  summarize(count_nu = sum(ResultCount, na.rm = TRUE),
            begin_year = min(YearSummarized, na.rm = TRUE),
            end_year = max(YearSummarized, na.rm = TRUE)) |> 
  ungroup() |> 
  filter(count_nu > 300) %>%
  mutate(period = end_year - begin_year) %>%
  filter(period > 15)

nrow(phWI.1)
[1] 40

# Compare to the original:
phos_data_og <- whatWQPdata(siteType = "Stream",
                          statecode = "WI",
                          CharacteristicName = "Phosphorus")

phWI.2 <- phos_data_og |> 
  filter(resultCount > 300)
nrow(phWI.2)
[1] 84

Right away we're able to call 44 fewer sites with confidence that we're only asking for the data that we're interested in. In this particular example, that might not make a big difference, but with bigger data requests, that could be a big time saver.

This service can also efficiently focus on the last 1 or 5 years with the summaryYears argument (if not specified, it will default to all years):

dane_county_data <- whatWQPdata2(countycode = "US:55:025",
                        summaryYears = 5,
                        siteType = "Stream")
ldecicco-USGS commented 2 years ago

@lindsayplatt , @limnoliver , @jread-usgs , @aappling-usgs , @wdwatkins (and please tag anyone else who might be interested in large WQP data pulls)... Any feedback is appreciated!

Specifically...

  1. Do you think this function should be it's own new function? If so, does anyone have a better name than whatWQPdata2? OR
  2. Do you think this functionality should be wrapped up into the original whatWQPdata function? We could add a user argument which would allow the user to specify the output so as to not lose the original output option.

My vote is to keep it as a new separate function (1) I'm not too upset with the name because it helps convey that it's got a VERY similar functionality, and it's a newer service. That being said...I'm terrible with naming things and easily swayed by your all's opinions.

I'm hoping to put together a vignette/blog that could use this function in a targets workflow to make a large-scale data pull. The request for assistance on that kind of a workflow is becoming more frequent. You all have done more work on that than I have, so if anyone is interested in either (a) writing it outright (b) collaborating as a blog co-author (c) offering general feedback or doing an official review, let me know.

jordansread commented 2 years ago

Cool!

I haven't thought about function names, but agree this service makes sense as a separate function.

For the vignette/blog, that sounds great. We do have some funding and direction to create some "example implementations of common data processing workflows" (w/ targets) this year and I figured a WQP pull would be high on that list. We could collaborate on this or discuss a distribution of the roles. If you lead the blog, due to the complement to our deliverables and the influence of getting a pattern in for early adopters, this would probably be a spot where I'd want some careful review and probably some suggested changes that might otherwise be unnecessary but in this case would make things aligned with how we're teaching targets and workflows to project staff and new employees.

ldecicco-USGS commented 2 years ago

Awesome! I've got 2 people interested in a similar sample workflows in the near term. I would be more than happy to have someone else take the lead on the blog/vignette, and the timeline for that is completely flexible.

How does this sound.....I put together a target-less script using this function to kick things off (in the next couple of days so they can get their work started). I'll share that with the user and you all. Then someone from DS can take that example and spruce it up with a super-charged targets workflow on your own timeline? (or...the DS team will have their own pipeline example, and the user and I can use your writeup to convert the script to a data pipeline)

lstanish-usgs commented 2 years ago

I agree that it makes sense to keep it as a separate function. As for the name, what about something like summarizeWQPdata()? I like that it is descriptive.

ldecicco-USGS commented 2 years ago

HAHAHAHAHA 🤣 We already have this function...it's called readWQPsummary: https://github.com/USGS-R/dataRetrieval/blob/master/R/whatWQPsites.R#L111 Looks like it initially written in April 2019...and then in this form Feb. 2021....

I'm finally using it to help users with big pulls and here's what I learned.

It's SLOW. It's faster to ask for all the data in most situations that I've tested. The exception is if you want specifically what chemicals were measured in the last 5 years at certain cite types. But, if you want to know general information (count, date range) of specific parameters on a large geographic scale, it's slow.

SO, we've got a ticket in the WQP development system to speed up the summary services by adding specific CharacteristicName or CharacteristicType optional arguments, and to make sure state (and HUC?) are indexed to make this service more useful.