USEPA / EPATADA

This R package can be used to compile and evaluate Water Quality Portal (WQP) data for samples collected from surface water monitoring sites on streams and lakes. It can be used to create applications that support water quality programs and help states, tribes, and other stakeholders efficiently analyze the data.
https://usepa.github.io/EPATADA/
Creative Commons Zero v1.0 Universal
40 stars 18 forks source link

Address EPA Use Case via TADA_BigDataRetrieval #362

Open cristinamullin opened 11 months ago

cristinamullin commented 11 months ago

Is your feature request related to a problem? Please describe.

EPA partners may need to download large amounts of WQP data for specific projects. Using contractors, the current process can take a lot of time to learn and implement. Here is an example need:

Thanks so much again for taking the time last Thursday to discuss the TADA tool functions and troubleshooting with our contractor on data downloads from the Water Quality Portal. We really appreciate your insights and recommendations.

Following up on that discussion, I wanted to share the details for how we downloaded the WQP data for the previous cycle of CCL For CCL 6, we are looking to repeat the procedure described below, with a slight modification to the year ranges; we need to acquire NAWQA monitoring program data for 1991 – 2021, and NWIS database results not associated with the NAWQA program for years 2013 – 2022:

Data source: National Water Information System (NWIS) and National Ambient Water Quality Assessment (NAWQA) Programs – Water Quality Portal (WQP), USGS Reference: United States Geological Survey (USGS). n.d. National Water-Quality Assessment (NAWQA) Program. Accessed via the Water Quality Portal (WQP). URL: https://www.waterqualitydata.us/portal/. Accessed January 2018. United States Geological Survey (USGS). n.d. National Water Information System (NWIS). USGS Water Data for the Nation. Accessed via the Water Quality Portal (WQP). URL: https://www.waterqualitydata.us/portal/. Accessed January 2018.

Data description: The Water Quality Portal is a collaborative tool sponsored by EPA, USGS, and the National Water Quality Monitoring Council (NWQMC) that allows access to water quality data collected by state, tribal, local and federal agencies. The Water Quality Portal is used to access the USGS National Water Information Services (NWIS) database. The NWIS relational database houses every piece of data that USGS collects, including information like gauge heights and compound concentration data and results from the National Water-Quality Assessment (NAWQA) program. The goals of the NAWQA program include assessing the condition of the nation’s streams, rivers, and groundwater and identifying how those conditions are changing over time. The NAWQA program is designed to be statistically representative of water conditions in the nation. NAWQA data are considered nationally representative, whereas NWIS results are not expected to be statistically representative of the U.S. These data sources were used as primary data sources for CCL 5.

Data download: In the Water Qualtiy Portal, EPA downloaded all data from the NAWQA monitoring program from 1991 through 2017. The results in the NWIS database that were not associated with the NAWQA program were downloaded for samples collected from 2008 through 2017. Raw data were download using REST API and saved into a SQL Server database. Data excluded from the analysis include non-water data, data from media other than ground water or surface water (e.g., leachate, etc.), and data with non-standard units of measure.

You had mentioned during the call having some contacts on the Water Quality Portal Team; we’re hoping you might be able to reach out on our behalf and describe the data download we need for CCL 6 if it wouldn’t be too much trouble. We’re especially interested to know if there’s a back-channel method we can use to acquire the data outside of the public-facing portal to hopefully speed up the process for our contractor

#install and load TADA and dataRetrieval
if(!"remotes"%in%installed.packages()){
  install.packages("remotes")
}
library(remotes)
remotes::install_github("USEPA/TADA", ref = "develop", dependencies = TRUE)
library(TADA)

# Query WQP for all USGS NAWQA data in WY (all years)
test = TADA_DataRetrieval(
  sampleMedia = c("Water", "water"),
  statecode = "WY", # consider downloading only 1 state at a time
  project = "National Water Quality Assessment Program (NAWQA)", 
  applyautoclean = FALSE
)

# Query WQP for all USGS NAWQA data in CT (all years)
test2 = TADA_DataRetrieval(
  sampleMedia = c("Water", "water"),
  statecode = "CT", # consider downloading only 1 state at a time
  project = "National Water Quality Assessment Program (NAWQA)", 
  applyautoclean = FALSE
)

# query WQP for all data (NWIS and WQX) from last 10 years in CT
test3 = TADA_DataRetrieval(
  startDate = "2013-01-01", 
  endDate = "2022-12-31", 
  sampleMedia = c("Water", "water"),
  statecode = "CT", # consider downloading only 1 state at a time
  applyautoclean = FALSE
)

# query WQP for data (NWIS ONLY) from last 10 years in CT
test4 = TADA_DataRetrieval(
  startDate = "2013-01-01", 
  endDate = "2022-12-31", 
  sampleMedia = c("Water", "water"),
  statecode = "CT", # consider downloading only 1 state at a time
  providers = "NWIS",
  applyautoclean = FALSE
)

remotes::install_github("USGS-R/dataRetrieval")
library(dataRetrieval)
# Summary of the data available in New Jersey:
# last 5 years
summaryservice <- readWQPsummary(statecode="NJ",
                                 summaryYears = 5)
# all years
summaryservice2 <- readWQPsummary(statecode="NJ",
                                  summaryYears = "all")

Describe the solution you'd like

Unfortunately there isn't a good back door method to get large amounts of data. Instead, we need to develop a workflow that would work for this scenario. Within TADA's retrieval function, we should be able to automatically chunk queries by state and also year if needed depending on the total number of sites and/or rows in each chunk using the summary service. It would be great if the function could do that automatically to address this scenario, so the user could simply say "I need all data in the WQP for the last 10 years", and be able to get it via the big data retrieval function (running overnight).

I went ahead and created an example.R file with queries that work on my end. We would to loop through each state (and likely by year as well). Note: USGS also shared these instructions for large data pulls which uses the WQP summary service: https://waterdata.usgs.gov/blog/large_sample_pull/

Describe alternatives you've considered

A different workflow such as targets.

Additional context

Consider combining TADA big and reg data retrieval functions into one.

Reminders for TADA contributors addressing this issue

New features should include all of the following work:

cristinamullin commented 9 months ago

ECHO WQI is another EPA use case for WQP data (need to refresh 2-4 times a year for application): https://echo.epa.gov/maps/wqimap?check_logged_in=1. The solution should automate the chunking download solution for large queries, and should also provide an option to not re-download data if it already exists (has already been downloaded) and does not have a more recent change data from the last download date (only download new or updated data). They currently rely on the site and result profiles.