Max number of stations - Githubissues

SSSCharmCity commented 7 years ago

Hi I am using readNWISsite to retrieve gauge info by HUC02 but run into a variety of errors associated with upper bounds on # gauges retrieved. e.g. for HUC 01 I have a list of 739 site numbers, but if I try to retrieve more than 708 numbers (regardless or the order) I get a [414] request too long message.

Are max retrieval limits documented? Is the upper limit on # of stations or # bytes?

ldecicco-USGS commented 7 years ago

There is a maximum number only because there is a maximum limit to how long a URL can be. When you specify getting the data with a list of sites, dataRetrieval just chains all the sites together in a single URL call. I can't tell you the maximum number of sites because the sites have different number of digits.

You have a couple of options though. You could chunk up the sites in groups of 100 (that's a pretty safe bet). I'd use the bind_row function in dplyr to just append the new retrievals to a single data frame.

Another option is to use the function readNWISdata, and simply ask for the full set of HUC02 data. If you post your readNWISsite call, I can help you figure out the readNWISdata call you would need.

SSSCharmCity commented 7 years ago

I appreciate the limits you describe (and why the # gauges is indeterminate)

I first encountered the problem with the following initial call - specifying huc instead of stateCd.:

siteData <- whatNWISsites(stateCd = "NC",
                          #huc = "01",
                          agencyCd = "USGS",
                       parameterCd="00060",
                       hasDataTypeCd="dv"
                       )

If I may, I ran into a few other surprises with dataRetrieval calls.

It took me a day to figure out that not all the data queries in R return the information in the same station order., Is this just "par for the course"? After using whatNWISsite to get initial set of station numbers (returned in decreasing order) I used readNWISsite [which returned station infor in increasing station order] to also get drainage area.

easily handled once I realized what was happening. But I kept appending DA from one retrieval to station info from another retrieval without realizing they were in different station order.

Curious if this too sounds familiar, or if I am just "detecting" symptoms of a more basic problem in my use of the dataretrieval calls?

Lastly, I was just wondering if there are any resources or guidance on doing basic QA/QC when retrieving large sets of station information? I'm working on regional drought statistics across states, ecoregions, HUC02s, etc. it just seems like each time I retrieve gauge info from a new state, I find one more oddball condition that needs a "one-off" screening criteria. I understand that these data will always have warts and pimples.

Just wondering if those who have preceded me in doing regional/national analysis may have documented useful screens or other QA/QC checks for assembling large consistent sets of gauges?

FYI, as an example,

I use whatNWISsites to get an initial list of USGS sites in NC with daily av. discharge (returns total of 546 station numbers returned not strictly sorted)

I extract the station ids and use readNWISsite to get info for these sites including drainage area. (returns 546 stations in decreasing station_no order).

I subset the data to eliminate DA=NA,and non-usgs stations and now have a list of 535 station ids.

Finally I use whatNWISdata to retrieve information for my revised station list, and I get 537 stations. The two extra station records correspond to identical station_nos, one for published data, and the other for DCP data.

I'm not grousing ;-). It's just that I encounter one or two unanticipated "special situations" every time I extend my analysis to a new state (or new HUC, etc.). After manually figuring out the cause of each unexpected glitch (which disrupts my crude fledgling code) I am building up an ad-hoc set of "special situation rules" that will eventually, collectively (I hope!!) serve as some improved QA/QC screening on assembling a large set of gauges for analysis. Nothing new about that either! ;-)

Just wondering if those who have gone through this same exercise before me, might have documented the oddball records in NWIS (like a gauge that was moved downstream due to construction, that now has two entries under the same station_no with same period of record, but data in different periods).

Sorry for the verbose "piggy backed" questions.

thanks again for any help or suggestions you can offer.

Best, stu

ldecicco-USGS commented 7 years ago

I'll try to answer a few questions one at a time.

If you are using this query:

siteData <- whatNWISsites(huc = "02",
                          agencyCd = "USGS",
                       parameterCd="00060",
                       hasDataTypeCd="dv"
                       )

You could get all the data with this request:

discharge <- readNWISdata(service="dv", 
                                              parameterCd="00060",
                                              huc="02",
                          startDate = "2017-01-01", endDate = "2017-02-05")

I added a start and end date to make it not so enormous. Maybe you are only interested in a particular time, then adding those dates will help a lot. (Note: the above query for me took awhile to retrieve, but ended up not being too big)

You can use the readNWISdata function, with the argument service="dv", and use any query parameter described here:

https://waterservices.usgs.gov/rest/DV-Test-Tool.html

If there's any filtering that can be done ahead of your query, I'd recommend using a bit of dplyr to narrow down the sites. You can use the readNWISdata with service="site", and seriesCatalogOutput=TRUE. Now we have a lot more information about the data. Maybe we want to get only sites that have data that go to at least to 2016, and at least 10000 data points, and no breaks in the data:

subSites <- siteData %>%
  filter(parm_cd == "00060") %>%
  filter(data_type_cd == "dv") %>%
  filter(count_nu >= 10000) %>%
  mutate(end_date = as.Date(end_date),
         begin_date = as.Date(begin_date)) %>%
  filter(end_date >= as.Date("2016-12-31")) %>%
  filter(as.integer(end_date) - as.integer(begin_date) == count_nu)
nrow(subSites)
#[1] 30

So, now we only need to ask for 30 sites, a pretty reasonable query.

ldecicco-USGS commented 7 years ago

It is true that it is "par for the course" for queries to come back with different station orders. Also, I would not assume the columns will always come back in the same order. My recommendation is to use the left_join function from dplyr:

library(dataRetrieval)
library(dplyr)

siteData <- readNWISdata(service="site",
                         huc = "02",
                         agencyCd = "USGS",
                         parameterCd="00060",
                         hasDataTypeCd="dv",
                         seriesCatalogOutput=TRUE)

subSites <- siteData %>%
  filter(parm_cd == "00060") %>%
  filter(data_type_cd == "dv") %>%
  filter(count_nu >= 10000) %>%
  mutate(end_date = as.Date(end_date),
         begin_date = as.Date(begin_date)) %>%
  filter(end_date >= as.Date("2016-12-31")) %>%
  filter(as.integer(end_date) - as.integer(begin_date) == count_nu)

discharge <- readNWISdv(subSites$site_no, "00060", "2000-01-01", "2016-12-31")

expanded_sites <- readNWISsite(subSites$site_no)

discharge <- discharge %>%
  left_join(select(expanded_sites, site_no, drain_area_va), by="site_no")

(So the mutate function is just dplyr's way of adding a new column. select is just grabbing a subset of columns from the expanded_sites returned data frame. Now, the discharge data frame has a drainage area column)

ldecicco-USGS commented 7 years ago

Finally (for now...)....I hear you, there is a TON of "special snowflake" data in NWIS. Unfortunately (or fortunately?) that's what you get with such an immense set of data that has been collected for over 100 years.

ldecicco-USGS commented 7 years ago

Oh, and I could add a site_no sort to whatNWISsites. But, I'd still always recommend left_join to cbind. So much safer, and very efficient. (Even more efficient methods if you want to start using the data.table package, dplyr's just my go-to package).

SSSCharmCity commented 7 years ago

Hi Laura:

Many thanks!

these are "golden nuggets" for me!

Much appreciated!

Best, stu

On Mon, Feb 6, 2017 at 2:00 PM, Laura DeCicco notifications@github.com wrote:

Oh, and I could add a site_no sort to whatNWISsites. But, I'd still always recommend left_join to cbind. So much safer, and very efficient. (Even more efficient methods if you want to start using the data.table package, dplyr's just my go-to package).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/USGS-R/dataRetrieval/issues/304#issuecomment-277778359, or mute the thread https://github.com/notifications/unsubscribe-auth/AYYdZLxFplMV8D1rAq2Owywt5L_hldZBks5rZ23WgaJpZM4L3hQb .

-- Stuart S. Schwartz PhD Senior Research Scientist Center for Urban Environmental Research and Education 410.455.2748 stu_schwartz@umbc.edu

There is something fascinating about science, One gets such wholesale returns of conjecture out of such a trifling investment of fact. ---Mark Twain

DOI-USGS / dataRetrieval

Max number of stations #304