DOI-USGS / dataRetrieval

This R package is designed to obtain USGS or EPA water quality sample data, streamflow data, and metadata directly from web services.
https://doi-usgs.github.io/dataRetrieval/
Other
260 stars 84 forks source link

Why does readNWISdata return multiple sites when only 1 is inputted into the function? #643

Closed raruggie closed 1 year ago

raruggie commented 1 year ago

What is your question? 'readNWISdata' funciton returns a dataframe with multiple sites when one is given in the 'sites' argument.

To Reproduce

library(dataRetrieval)
questionable_query <-readNWISdata(sites = "01200000", service = "qwdata", startDate = as.Date("2019-10-01"), endDate = as.Date("2019-10-05"))

Expected behavior I expect this to return a dataframe of all the water quality data for this site for the date range inputted into the function.

Screenshots instead it returns a dataframe with multiple sites listed: image

Session Info Please include your session info:

sessionInfo()
#OR preferred:
devtools::session_info()
- Session info -------------------------------------------
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/New_York            
 date     2022-10-29  

Additional context Add any other context about the problem here.

ldecicco-USGS commented 1 year ago

The NWIS "qwdata" service is a very unique web service. The readNWISdata was originally only designed to work with services that used the https://waterservices.usgs.gov service calls (qwdata uses https://nwis.waterdata.usgs.gov/nwis/qwdata).

If you've got a somewhat recent version of dataRetrieval you should be getting a Warning when you run the readNWISdata function with the argument service = "qwdata". The warning is:

Warning message:                                                                                                      
NWIS qw web services are being retired.
Please see vignette('qwdata_changes', package = 'dataRetrieval')
for more information.
https://cran.r-project.org/web/packages/dataRetrieval/vignettes/qwdata_changes.html

Sometime in the not so distant future we won't be able to ask for discrete water quality data from the NWIS web services. Instead, we'll all need to switch to using the Water Quality Portal. So for your query above, you'd want to use:

preferred_query <- readWQPdata(siteid = "USGS-01200000", 
                               startDateLo = as.Date("2019-10-01"), 
                               startDateHi = as.Date("2019-10-05"))

There didn't seem to be any data returned for that date. I checked the available data like this:

whats_available <- whatNWISdata(siteNumber = "01200000")

what_qw <- whats_available |> 
  filter(data_type_cd == "qw")

range(what_qw$begin_date)
[1] "1992-10-19" "1992-10-20"
range(what_qw$end_date)
[1] "1992-10-19" "1992-10-20"

So there doesn't seem to be data in Oct. 2019, only Oct. 1992. For the sake of additional examples, I'll ask for data in October 1992. If you want to use the NWIS function readNWISqw....keep in mind it will literally stop working sometime in 2023 (that's the latest I've heard). So I wouldn't recommend this, but if you need it for a quick fix:

limited_use <- readNWISqw("01200000", 
                          parameterCd = "all", 
                          startDate = as.Date("1992-10-01"),
                          endDate = as.Date("1992-10-31"))

If you are looking for a set of sites for specific dates, you can use this simplified function:

preferred_query_2 <- readWQPqw(siteNumbers = "USGS-01200000", 
                               parameterCd = "", 
                               startDate = as.Date("1992-10-01"), 
                               endDate = as.Date("1992-10-31"))

If you are curious why we never optimized the readNWISdata function for the water quality data, here is the input you would need to get the water quality data from that function.

questionable_query <- readNWISdata(multiple_site_no = "01200000", service = "qwdata", 
                                   sort_key = "site_no",
                                   group_key = "NONE",
                                   inventory_output = 0,
                                   begin_date = as.Date("1992-10-01"), 
                                   end_date = as.Date("1992-10-31"),
                                   radio_parm_cds = "all_parm_cds",
                                   date_format = "YYYY-MM-DD",
                                   list_of_search_criteria = "multiple_site_no",
                                   rdb_compression = "value",
                                   qw_attributes = 0,
                                   qw_sample_wide = 0,
                                   rdb_qw_attributes = "expanded")
raruggie commented 1 year ago

Laura, thank you very much for this. For some reason if I use readNWIS data with service = qwdata, or I use readNWISqw, I don't get an error (I only downloaded dataRetrival package a week or two ago, I believe I used install.packages). Just to clarify, in 2023 readNWISdata and readNWISqw will not work but readWQPqw will?

ldecicco-USGS commented 1 year ago

Sometime in 2023, we'll be losing the qwdata services from NWIS. So, readNWISqw will stop working entirely, and readNWISdata will not be able to use service="qwdata" (which as this issue illustrates...is not a great use for readNWISdata anyway). You will still be able to use readNWISdata for all other data types (dv, uv, gwlevels, etc...)

You should be seeing a "Warning" (not an error) when using either readNWISqw or if you use service="qwdata" in readNWISdata.

Going forward, for discrete water quality data from the USGS, you'll want to use the "WQP" functions: https://rconnect.usgs.gov/dataRetrieval/reference/#water-quality-portal-wqp-

See: https://cran.r-project.org/web/packages/dataRetrieval/vignettes/qwdata_changes.html

raruggie commented 1 year ago

I see that my dataRetrival version 2.7.6 and the help file you provided (https://cran.r-project.org/web/packages/dataRetrieval/vignettes/qwdata_changes.html) says the warning comes starting with version 2.7.9., makes sense I don't get the warning.

Should I also migrate whatNWISdata function over to the whatWQPdata function? They return different dataframes, but can the same info from the NWIS version of this function be extracted from the 'WQP' version of this function? The NWIS version returned multiple rows for each site inputted where each row was a parameter for a site, while WQP returns a single row for every site (I couldn't find the metadata for the columns activity count and result count, but I assume those correspond to the number of unique parameters for those sites?)

raruggie commented 1 year ago

here is my code:

library(tidyverse)
library(dataRetrieval)

# NWIS functions
NY <- whatWQPsites(statecode="US:36",characteristicName = "Phosphorus", siteType="Stream")%>%
  filter(ProviderName == "NWIS")%>% # need to filter NWIS site as whatNWISdata only works with NWIS sites?
  mutate(site_cd = substring(MonitoringLocationIdentifier, 6)) # create new column with just the number part of the site 
site_data_NWIS <- whatNWISdata(siteNumber = NY$site_cd)

# WQP functions
NY <- whatWQPsites(statecode="US:36",characteristicName = "Phosphorus", siteType="Stream") 
site_data_WQP<-whatWQPdata(siteNumber = NY$MonitoringLocationIdentifier)
ldecicco-USGS commented 1 year ago

My understanding is that there will always be some service that will be available from NWIS to get that same info that is currently being supplied in the whatNWISdata function. There's a good chance that the guts of the function will need to be re-written someday, but I confirmed that there is a priority to make sure we can ask we'll be able to find out all the data available for a USGS site from a single NWIS query even after the qwdata service is shut down.

So...you can continue to use the whatNWISdata function (especially if you are absolutely only concerned with USGS sites....if you want to get water quality data from non-USGS data, you'll need to use the whatWQPdata function)

The function in WQP that is most similar to whatNWISdata is a newer function readWQPsummary. There's no equivalent output, but in the summary function, you can find out what data is available each year. This article shows an example for how to use it: https://rconnect.usgs.gov/dataRetrieval/articles/wqp_large_pull_script.html

raruggie commented 1 year ago

I am having trouble with the readWQPsummary function, it is returning an empty dataframe:

library(dataRetrieval)

state_cd_cont <- stateCd[c(2,4:12,14:52),] 
rownames(state_cd_cont) <- seq(length=nrow(state_cd_cont))

i<-30 # this is New York in state_cd_cont dataframe
state_cd <- state_cd_cont$STATE[i]
state_nm <- state_cd_cont$STUSAB[i]

ny <- readWQPsummary(statecode = state_cd,CharacteristicName = "Nitrogen",siteType = "Stream") # returns empty dataframe
ny <- readWQPsummary(statecode="US:36",siteType = "Stream") # still returns empty dataframe
ny <- whatWQPdata(statecode="US:36",siteType = "Stream") # this works but not the same as whatNWISdata dataframe

I've also tried this for different states (i.e. changing 'i') but still returns an empty dataframe

raruggie commented 1 year ago

I suspect the issue could be I am using dataRetrieval version 2.7.6 and the link you sent says 2.7.11.1... I tried to update the package by install.packages('dataRetrieval') in a new R studio session (https://stackoverflow.com/questions/21461649/how-to-update-a-package-in-r), which ran successfully, but when I check sessionInfo() it still says I have 2.7.6

raruggie commented 1 year ago

My R version is 4.0.2, and I see that it does not recognize the new R pipe |> implemented in version 4.1 (https://www.r-bloggers.com/2021/05/the-new-r-pipe/), could this be the issue?

ldecicco-USGS commented 1 year ago

Try updating to the latest github version. I expect to push out some updates to CRAN soon, but not sure exactly when that will happen. So, first you'll need the "remotes" package. Then, you can install like this:

#install.packages("remotes") #run this if you don't have the remotes package already installed
remotes::install_github("USGS-R/dataRetrieval")

when I run the code you posted above, I do get a data frame, so try updating and let me know if that works. I don't see any modern pipes in the code, so I don't think that should matter (unless you are trying to run the full "scripting" article I posted above... you could change those pipes to %>% and it should work).

raruggie commented 1 year ago

readWQPsummary works when updating to 2.7.11.1 thank you!

When I run readWQPsummary and whatNWISdata with the same search criteria (New York, Phosphorus, streams), I am getting unexpected results. I expect that the WQP function will return a more comprehensive dataset than the NWIS function, in terms of earliest sample date on record, but that is not the case:

library(dataRetrieval)
library(dplyr)

# WQP functions
ny_WQP_read <- readWQPsummary(statecode = "36",CharacteristicName = "Phosphorus",siteType = "Stream")
min(ny_WQP_read$YearSummarized) # 1951

# NWIS functions
ny_NWIS_sites <- whatWQPsites(statecode="US:36",characteristicName = "Phosphorus", siteType="Stream")%>% 
  filter(ProviderName == "NWIS")%>% # need to filter NWIS site for whatNWISdata to execute
  mutate(site_cd = substring(MonitoringLocationIdentifier, 6)) # create new column with just the number code of the NWIS site to use in whatNWISdata

ny_NWIS_what <- whatNWISdata(siteNumber = ny_NWIS_sites$site_cd)
min(na.omit(ny_NWIS_what$begin_date)) # 1785...? this year should be equal to or greater than the WQP since this is only querying NWIS sites, where WQP querys all water quality databases?

length(unique(ny_NWIS_what$site_no))
length(unique(ny_WQP_read$MonitoringLocationIdentifier)) # at least the WQP function gives more sites

This may seem trivial but I am just curious, I really do appreciate all of your help, thank you

ldecicco-USGS commented 1 year ago

The function whatNWISdata will give you more information about ALL the USGS data, not just discrete water-quality. So even though you are inputting the distinct USGS sites that came back from the whatWQPsites, you are getting information about all sorts of other data: daily (discharge or other), instantaneous, peak, etc.... So to make a good comparison of the 2 outputs, filter down to just the data_type_cd == "qw" and then the pcode associated with phosphorus:

# NWIS functions
ny_NWIS_sites <- whatWQPsites(statecode="US:36",characteristicName = "Phosphorus", siteType="Stream")%>% 
  filter(ProviderName == "NWIS")%>% # need to filter NWIS site for whatNWISdata to execute
  mutate(site_cd = substring(MonitoringLocationIdentifier, 6)) # create new column with just the number code of the NWIS site to use in whatNWISdata

ny_NWIS_what <- whatNWISdata(siteNumber = ny_NWIS_sites$site_cd)

# Check out all the different kind of returned data!
pcodes <- readNWISpCode(unique(ny_NWIS_what$parm_cd))

# narrow down to phos:
phos_pcodes <- pcodes |> 
  filter(grepl("Phosphorus", parameter_nm))

# probably you want 00665, that's the most common pcode for phosphorus

#the only data that would match in NWIS and WQP is data_type_cd == "qw":
ny_NWIS_what_qw <- ny_NWIS_what |> 
  filter(data_type_cd == "qw",
           parm_cd == "00665")

old_site <- ny_NWIS_what_qw$site_no[which.min(ny_NWIS_what_qw$begin_date)]

old_site <- ny_NWIS_what_qw$site_no[which.min(ny_NWIS_what_qw$begin_date)]
# NWIS:
old_data_qw_NWIS <- readNWISqw(old_site, 
                          parameterCd = "all")
range(old_data_qw_NWIS$sample_dt)
[1] "1969-08-19" "1971-05-08"

# WQP summary:
ny_WQP_read1 <- readWQPsummary(siteid = paste0("USGS-", old_site))
min(ny_WQP_read1$YearSummarized) 
[1] 1969

# WQP data:
old_data_qw_WQP <- readWQPqw(paste0("USGS-", old_site), 
                             "Phosphorus")
range(old_data_qw_WQP$ActivityStartDate)
[1] "1969-08-19" "1971-05-08"

I think everything is consistent....let me know if that doesn't make sense.