DOI-USGS / dataRetrieval

This R package is designed to obtain USGS or EPA water quality sample data, streamflow data, and metadata directly from web services.
https://doi-usgs.github.io/dataRetrieval/
Other
259 stars 84 forks source link

Single site has duplicate data across HUC08s #592

Closed lindsayplatt closed 2 years ago

lindsayplatt commented 2 years ago

Describe the bug For some reason, site 05586300 which is in HUC 07130011 (verified using readNWISsite("05586300")) is triggering data to be returned when you query HUC 07130007, too, but only when that HUC is queried at the same time as 07130011.

To Reproduce

When you query ONLY HUC 07130011, you get one row of data back for that site on 2015-06-18. When you make a query to ONLY HUC 07130007, you get no data back for this site on 2015-06-18. BUT if you query both HUCs in one call to readNWISdata(), it returns two rows of data for that one site on 2015-06-18. And that data is not a full duplicate of the data you get back from only querying HUC 07130011; it's only a partial duplicate.

library(dataRetrieval)
library(dplyr)

# Look at site metadata
readNWISsite("05586300") %>% select(site_no, huc_cd)

   site_no   huc_cd                                                                                                                                                     
1 05586300 07130011

# Query data for that site by itself on June 18, 2015
dataRetrieval::readNWISdata(siteNumber = "05586300", 
                            parameterCd = "63680", 
                            startDate = "2015-06-18", 
                            endDate = "2015-06-18", 
                            service="dv")

  agency_cd  site_no   dateTime X_.HACH._63680_00001 X_.HACH._63680_00001_cd X_.YSI._63680_00002 X_.YSI._63680_00002_cd
1      USGS 05586300 2015-06-18                  260                       A                  63                      A
  X_.HACH._63680_00002 X_.HACH._63680_00002_cd X_.HACH._63680_00003 X_.HACH._63680_00003_cd tz_cd
1                   94                       A                  154                       A   UTC

# Now query for data for the HUC that this site is actually in and look only at the data for this site
dataRetrieval::readNWISdata(huc = "07130011", 
                            parameterCd = "63680", 
                            startDate = "2015-06-18", 
                            endDate = "2015-06-18", 
                            service="dv") %>% 
  filter(site_no == "05586300")

  agency_cd  site_no   dateTime X_.HACH._63680_00001 X_.HACH._63680_00001_cd X_.YSI._63680_00002 X_.YSI._63680_00002_cd
1      USGS 05586300 2015-06-18                  260                       A                  63                      A
  X_.HACH._63680_00002 X_.HACH._63680_00002_cd X_.HACH._63680_00003 X_.HACH._63680_00003_cd tz_cd
1                   94                       A                  154                       A   UTC

# Now query for the two HUCs that both seem to contain this site
multi_hucs <- c("07130007", "07130011")
dataRetrieval::readNWISdata(huc = multi_hucs, 
                            parameterCd = "63680", 
                            startDate = "2015-06-18", 
                            endDate = "2015-06-18", 
                            service="dv") %>% 
  filter(site_no == "05586300")

  agency_cd  site_no   dateTime X_.YSI._63680_00002 X_.YSI._63680_00002_cd X_.HACH._63680_00001 X_.HACH._63680_00001_cd
1      USGS 05586300 2015-06-18                  63                      A                   NA                    <NA>
2      USGS 05586300 2015-06-18                  NA                   <NA>                  260                       A
  X_.YSI._63680_00001 X_.YSI._63680_00001_cd X_.YSI._63680_00003 X_.YSI._63680_00003_cd X_.HACH._63680_00002 X_.HACH._63680_00002_cd
1                  NA                   <NA>                  NA                   <NA>                   94                       A
2                  NA                   <NA>                  NA                   <NA>                   NA                    <NA>
  X_.HACH._63680_00003 X_.HACH._63680_00003_cd tz_cd
1                  154                       A   UTC
2                  154                       A   UTC

# Why does this other HUC have data?? And why when you call it by itself does it return nothing (expected)
# but returns values when it was called alongside HUC 07130011 (unexpected)
dataRetrieval::readNWISdata(huc = "07130007", 
                            parameterCd = "63680", 
                            startDate = "2015-06-18", 
                            endDate = "2015-06-18", 
                            service="dv") %>% 
  filter(site_no == "05586300")
 [1] agency_cd               site_no                 dateTime                X_.YSI._63680_00001     X_.YSI._63680_00001_cd 
 [6] X_.HACH._63680_00001    X_.HACH._63680_00001_cd X_.YSI._63680_00002     X_.YSI._63680_00002_cd  X_.HACH._63680_00002   
[11] X_.HACH._63680_00002_cd X_.YSI._63680_00003     X_.YSI._63680_00003_cd  X_.HACH._63680_00003    X_.HACH._63680_00003_cd
[16] tz_cd                  
<0 rows> (or 0-length row.names)

Expected behavior I would expect these two commands to return the exact same thing since the only difference is which HUCs are being queried and the site should only exist in one HUC:

# Just the one HUC that contains this site returns one row of data for this site + day
dataRetrieval::readNWISdata(huc = "07130007", 
                            parameterCd = "63680", 
                            startDate = "2015-06-18", 
                            endDate = "2015-06-18", 
                            service="dv") %>% 
  filter(site_no == "05586300")

# Call two HUCs at once and get two rows of data for this site + day instead of just one row
multi_hucs <- c("07130007", "07130011")
dataRetrieval::readNWISdata(huc = multi_hucs, 
                            parameterCd = "63680", 
                            startDate = "2015-06-18", 
                            endDate = "2015-06-18", 
                            service="dv") %>% 
  filter(site_no == "05586300")

Session Info Please include your session info:

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dataRetrieval_2.7.10 dplyr_1.0.7 

Additional context I would not like the resolution to just be to call one HUC at a time because my workflow is setup in a way to group HUCs and make a smaller number of queries but do multiple HUCs at once.

ldecicco-USGS commented 2 years ago

@lindsayplatt , install dataRetrieval from GitHub and try a few of your workflows. Let me know if anything seems funky.

remotes::install_github("USGS-R/dataRetrieval")
lindsayplatt commented 2 years ago

This command is only giving me one row of data now!

multi_hucs <- c("07130007", "07130011")
dataRetrieval::readNWISdata(huc = multi_hucs, 
                            parameterCd = "63680", 
                            startDate = "2015-06-18", 
                            endDate = "2015-06-18", 
                            service="dv") %>% 
    filter(site_no == "05586300")

  agency_cd  site_no   dateTime X_.YSI._63680_00002 X_.YSI._63680_00002_cd X_.HACH._63680_00001 X_.HACH._63680_00001_cd
1      USGS 05586300 2015-06-18                  63                      A                  260                       A
  X_.YSI._63680_00001 X_.YSI._63680_00001_cd X_.YSI._63680_00003 X_.YSI._63680_00003_cd X_.HACH._63680_00002
1                  NA                   <NA>                  NA                   <NA>                   94
  X_.HACH._63680_00002_cd X_.HACH._63680_00003 X_.HACH._63680_00003_cd tz_cd
1                       A                  154                       A   UTC
lindsayplatt commented 2 years ago

The values do match the ones when I query just the one HUC 👍 There are more columns than just the one HUC because I queried for more data and then just filtered one the site_no column.