DOI-USGS / dataRetrieval

This R package is designed to obtain USGS or EPA water quality sample data, streamflow data, and metadata directly from web services.
https://doi-usgs.github.io/dataRetrieval/
Other
260 stars 84 forks source link

Query for UV between date ranges is shifted slightly from full day in local time #634

Closed lindsayplatt closed 2 years ago

lindsayplatt commented 2 years ago

Describe the bug I am using the UV service to query sites from across the country. I supply a date range to the query expecting to get data back between those dates in local time. The data is returned as UTC but once you convert back to local time, I would expect to see all instantaneous data between 00:00:00 on the first date of the query and 23:59:59 on the last day. I was just digging into an issue I was seeing where I unexpectedly had data which spilled into the day after my specified dates (after converting to local time). I also noticed that I am missing data from the beginning of the query, too.

To Reproduce Steps to reproduce the behavior:

library(dataRetrieval)
data_query_all <- readNWISuv(siteNumbers = '373904118570701',
                             parameterCd = '72019',
                             startDate = '2022-09-06',
                             endDate = '2022-09-07') %>% 
  mutate(local_dateTime = format(dateTime, "%Y-%m-%d %H:%M:%S", tz="America/Los_Angeles", usetz=TRUE)) %>% 
  arrange(local_dateTime)

data_query_all %>% head(10)
data_query_all %>% tail(15)

Console output:

> data_query_all %>% head(10)
   agency_cd         site_no            dateTime X_72019_00000 X_72019_00000_cd tz_cd          local_dateTime
1       USGS 373904118570701 2022-09-06 08:00:00        333.45                P   UTC 2022-09-06 01:00:00 PDT
2       USGS 373904118570701 2022-09-06 08:05:00        333.45                P   UTC 2022-09-06 01:05:00 PDT
3       USGS 373904118570701 2022-09-06 08:10:00        333.45                P   UTC 2022-09-06 01:10:00 PDT
4       USGS 373904118570701 2022-09-06 08:15:00        333.45                P   UTC 2022-09-06 01:15:00 PDT
5       USGS 373904118570701 2022-09-06 08:20:00        333.45                P   UTC 2022-09-06 01:20:00 PDT
6       USGS 373904118570701 2022-09-06 08:25:00        333.45                P   UTC 2022-09-06 01:25:00 PDT
7       USGS 373904118570701 2022-09-06 08:30:00        333.44                P   UTC 2022-09-06 01:30:00 PDT
8       USGS 373904118570701 2022-09-06 08:35:00        333.44                P   UTC 2022-09-06 01:35:00 PDT
9       USGS 373904118570701 2022-09-06 08:40:00        333.44                P   UTC 2022-09-06 01:40:00 PDT
10      USGS 373904118570701 2022-09-06 08:45:00        333.44                P   UTC 2022-09-06 01:45:00 PDT

> data_query_all %>% tail(15)
    agency_cd         site_no            dateTime X_72019_00000 X_72019_00000_cd tz_cd          local_dateTime
562      USGS 373904118570701 2022-09-08 06:45:00        333.50                P   UTC 2022-09-07 23:45:00 PDT
563      USGS 373904118570701 2022-09-08 06:50:00        333.50                P   UTC 2022-09-07 23:50:00 PDT
564      USGS 373904118570701 2022-09-08 06:55:00        333.50                P   UTC 2022-09-07 23:55:00 PDT
565      USGS 373904118570701 2022-09-08 07:00:00        333.50                P   UTC 2022-09-08 00:00:00 PDT
566      USGS 373904118570701 2022-09-08 07:05:00        333.50                P   UTC 2022-09-08 00:05:00 PDT
567      USGS 373904118570701 2022-09-08 07:10:00        333.50                P   UTC 2022-09-08 00:10:00 PDT
568      USGS 373904118570701 2022-09-08 07:15:00        333.50                P   UTC 2022-09-08 00:15:00 PDT
569      USGS 373904118570701 2022-09-08 07:20:00        333.50                P   UTC 2022-09-08 00:20:00 PDT
570      USGS 373904118570701 2022-09-08 07:25:00        333.49                P   UTC 2022-09-08 00:25:00 PDT
571      USGS 373904118570701 2022-09-08 07:30:00        333.49                P   UTC 2022-09-08 00:30:00 PDT
572      USGS 373904118570701 2022-09-08 07:35:00        333.49                P   UTC 2022-09-08 00:35:00 PDT
573      USGS 373904118570701 2022-09-08 07:40:00        333.49                P   UTC 2022-09-08 00:40:00 PDT
574      USGS 373904118570701 2022-09-08 07:45:00        333.49                P   UTC 2022-09-08 00:45:00 PDT
575      USGS 373904118570701 2022-09-08 07:50:00        333.49                P   UTC 2022-09-08 00:50:00 PDT
576      USGS 373904118570701 2022-09-08 07:55:00        333.49                P   UTC 2022-09-08 00:55:00 PDT

Expected behavior I would expect to have data ranging from 2022-09-06 00:00:00 to 2022-09-07 23:59:59 after converting from UTC to the local time of the gage using America/Los Angeles. Not ranging from 2022-09-06 01:00:00 to 2022-09-08 00:55:00, as I do above.

Screenshots If applicable, add screenshots to help explain your problem.

Session Info Please include your session info:

> devtools::session_info()
- Session info  -----------------------------------------------------------------------------------------------
 hash: busts in silhouette, mage: medium-light skin tone, hand with fingers splayed

 setting  value
 version  R version 4.1.1 (2021-08-10)
 os       Windows 10 x64 (build 19044)
 system   x86_64, mingw32
 ui       RStudio
 language (EN)
 collate  English_United States.1252
 ctype    English_United States.1252
 tz       America/Chicago
 date     2022-09-12
 rstudio  2022.07.0+548 Spotted Wakerobin (desktop)
 pandoc   NA

- Packages ----------------------------------------------------------------------------------------------------
 package       * version date (UTC) lib source
 assertthat      0.2.1   2019-03-21 [1] CRAN (R 4.1.1)
 cachem          1.0.6   2021-08-19 [1] CRAN (R 4.1.1)
 callr           3.7.0   2021-04-20 [1] CRAN (R 4.1.1)
 class           7.3-19  2021-05-03 [2] CRAN (R 4.1.1)
 classInt        0.4-3   2020-04-07 [1] CRAN (R 4.1.1)
 cli             3.1.0   2021-10-27 [1] CRAN (R 4.1.2)
 crayon          1.4.2   2021-10-29 [1] CRAN (R 4.1.2)
 dataRetrieval * 2.7.11  2022-02-18 [1] CRAN (R 4.1.3)
 DBI             1.1.1   2021-01-15 [1] CRAN (R 4.1.1)
 desc            1.4.0   2021-09-28 [1] CRAN (R 4.1.2)
 devtools        2.4.3   2021-11-30 [1] CRAN (R 4.1.1)
 dplyr         * 1.0.7   2021-06-18 [1] CRAN (R 4.1.1)
 e1071           1.7-9   2021-09-16 [1] CRAN (R 4.1.1)
 ellipsis        0.3.2   2021-04-29 [1] CRAN (R 4.1.1)
 fansi           0.5.0   2021-05-25 [1] CRAN (R 4.1.1)
 fastmap         1.1.0   2021-01-25 [1] CRAN (R 4.1.1)
 fs              1.5.1   2021-11-30 [1] CRAN (R 4.1.1)
 generics        0.1.1   2021-10-25 [1] CRAN (R 4.1.2)
 glue            1.6.0   2021-12-17 [1] CRAN (R 4.1.2)
 httr            1.4.2   2020-07-20 [1] CRAN (R 4.1.1)
 jsonlite        1.7.2   2020-12-09 [1] CRAN (R 4.1.1)
 KernSmooth      2.23-20 2021-05-03 [2] CRAN (R 4.1.1)
 lifecycle       1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
 magrittr        2.0.1   2020-11-17 [1] CRAN (R 4.1.1)
 memoise         2.0.1   2021-11-26 [1] CRAN (R 4.1.2)
 pillar          1.6.4   2021-10-18 [1] CRAN (R 4.1.2)
 pkgbuild        1.2.1   2021-11-30 [1] CRAN (R 4.1.1)
 pkgconfig       2.0.3   2019-09-22 [1] CRAN (R 4.1.1)
 pkgload         1.2.4   2021-11-30 [1] CRAN (R 4.1.1)
 prettyunits     1.1.1   2020-01-24 [1] CRAN (R 4.1.1)
 processx        3.5.2   2021-04-30 [1] CRAN (R 4.1.1)
 proxy           0.4-26  2021-06-07 [1] CRAN (R 4.1.1)
 ps              1.6.0   2021-02-28 [1] CRAN (R 4.1.1)
 purrr           0.3.4   2020-04-17 [1] CRAN (R 4.1.1)
 R6              2.5.1   2021-08-19 [1] CRAN (R 4.1.1)
 Rcpp            1.0.7   2021-07-07 [1] CRAN (R 4.1.1)
 remotes         2.4.2   2021-11-30 [1] CRAN (R 4.1.1)
 rlang           0.4.12  2021-10-18 [1] CRAN (R 4.1.2)
 rprojroot       2.0.2   2020-11-15 [1] CRAN (R 4.1.1)
 rstudioapi      0.13    2020-11-12 [1] CRAN (R 4.1.1)
 sessioninfo     1.2.1   2021-11-02 [1] CRAN (R 4.1.2)
 sf              1.0-8   2022-07-14 [1] CRAN (R 4.1.3)
 testthat        3.1.0   2021-10-04 [1] CRAN (R 4.1.2)
 tibble          3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
 tidyselect      1.1.1   2021-04-30 [1] CRAN (R 4.1.1)
 units           0.7-2   2021-06-08 [1] CRAN (R 4.1.1)
 usethis         2.1.3   2021-10-27 [1] CRAN (R 4.1.2)
 utf8            1.2.2   2021-07-24 [1] CRAN (R 4.1.1)
 vctrs           0.3.8   2021-04-29 [1] CRAN (R 4.1.1)
 withr           2.4.3   2021-11-30 [1] CRAN (R 4.1.1)
 xml2            1.3.3   2021-11-30 [1] CRAN (R 4.1.1)

Additional context It almost looks like the data returned is shifted by an hour, which makes me think this is actually an issue with daylight savings - something with PST vs PDT? However, when I look at my values (console printouts above) from 1 AM to 1:30 AM on Sep 6th with NWISWeb (see screenshots below), they line up (at 1:25 AM the values equal 333.45 and at 1:30 AM they switch to 333.44; if the values were just shifted an hour, that switch in value would be the opposite direction and between :20 and :25, not :25 and :30). So, I am not sure if we are both wrong or what could be done about the query.

image

image

lindsayplatt commented 2 years ago

^ Noticing now that NWISWeb is also reporting in PDT (which is what my fxn, format(..., tz = 'America/Los_Angeles') converted to). However, the tz_cd associated with this gage is PST. So, does that mean the dateTimes returned from NWISWeb will always be PST no matter what time of year it is? If so, this is not really a bug but I may suggest we write a tweet or blog post sharing methods for these tricky timezone things. Sounds like I need a crosswalk function between the daylight savings tz codes and standard time. An argument to readNWISuv() to request that data be returned in the local timezone would be neat, too, but definitely a bigger effort.

ldecicco-USGS commented 2 years ago

Yeah, this is a weird site.

If you look at the raw data: https://waterservices.usgs.gov/nwis/iv/?site=373904118570701&format=waterml,1.1&ParameterCd=72019&startDT=2022-09-06&endDT=2022-09-07 Hidden in the metadata is:

<ns1:timeZoneInfo siteUsesDaylightSavingsTime="false">

Which means this site doesn't acknowledge daylight savings time. The first raw data to show up is: 2022-09-06T00:00:00.000-08:00

On Sept. 6 at midnight with an offset of 8 hours from UTC, that is 1am PDT (but since the sight doesn't use daylight savings, it's displaying it in PST).

Going to my trusty Wikipedia sites on timezones: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones It seems like there's not actually any official timezone that is only PST (they all include PDT).

Anyway...more importantly, there IS a timezone argument in readNWISuv:

data_query_all <- readNWISuv(siteNumbers = '373904118570701',
                             parameterCd = '72019',
                             startDate = '2022-09-06',
                             endDate = '2022-09-07', 
                             tz = "America/Los_Angeles") 
data_query_all$dateTime[1]
[1] "2022-09-06 01:00:00 PDT"

Normally I'd tell you to pick a timezone from that wikipedia page that doesn't have daylight savings (like in the east, "America/Jamaica", in the midwest "America/Guatemala") ...but the Pacific doesn't seem to have one!

There's a bit of discussion on time zones here: https://rconnect.usgs.gov/dataRetrieval/articles/tutorial.html#timetime-zone-discussion I also try to describe it a bit in the "tz" argument in the help file: https://rconnect.usgs.gov/dataRetrieval/reference/readNWISuv.html

I've run into a few sites over the years that don't use daylight savings even though they are in regions that use daylight savings....it's kind of always a pain.

Let me know what kind of info would be more helpful.

lindsayplatt commented 2 years ago

Thanks for all of this. I think it will take me a bit to dig through and see what I should use in my application. Definitely not a dataRetrieval bug, so feel free to close! I might just document any changes I make based on this here for others (and myself) to reference in the future.

There were a bunch of other sites that had this issue, too, so I need a pretty programmatic way to address it. I am testing out using this function to convert from a daylight time into savings time (and vice versa if necessary) based on the timezone abbreviation that is returned by readNWISsite() as the value for tz_desired. This is used after they are converted from UTC.

# Note that the output from this fxn will say 'PDT' but mean 'PST'
# because you can't have a timezone of 'PST' (it will convert to GMT, 
# even when using `lubridate::force_tz(., 'PST')`). This is used
# internally before dropping time and going to a day, so I am 
# accepting the risk.
adjust_for_daylight_savings <- function(posix_dates, tz_desired) {

  # To go from daylight time (DT) to standard time
  #  (ST), subtract an hour and vice versa.
  tz_conversion_xwalk <- tibble(
    from = c('DT', 'ST'),
    to = c('ST', 'DT'),
    conversion_sec = c(-3600, 3600)
  )

  tz_current <- unique(format(posix_dates, "%Z"))

  if(tz_current == tz_desired) {
    # Don't do anything if they already have the desired tz
    return(posix_dates)
  } else {

    # Use the last two characters in both the current and desired
    # timezones for matching with the conversion xwalk
    tz_conversion_now <- tibble(
      from = stringr::str_sub(tz_current, -2, -1),
      to = stringr::str_sub(tz_desired, -2, -1)
    )

    conversion_to_use <- tz_conversion_xwalk %>% 
      semi_join(tz_conversion_now, by = c('from', 'to')) %>% 
      pull(conversion_sec)

    return(posix_dates + conversion_to_use)
  }

}

# Using the function
tz_locale <- 'America/Los_Angeles'
tz_abbr <- readNWISsite('373904118570701') %>% pull(tz_cd)

data_query_all <- readNWISuv(siteNumbers = '373904118570701',
                             parameterCd = '72019',
                             startDate = '2022-09-06',
                             endDate = '2022-09-07')

format(data_query_all$dateTime, "%Y-%m-%d %H:%M:%S", 
       tz=tz_locale, usetz=TRUE) %>% 
  as.POSIXct(tz=tz_locale) %>% 
  adjust_for_daylight_savings(tz_desired = tz_abbr)
ldecicco-USGS commented 2 years ago

I think you probably need to test that code when the daylight savings times switch:

data_query_all <- readNWISuv(siteNumbers = '373904118570701',
                             parameterCd = '72019',
                             startDate = '2022-03-11',
                             endDate = '2022-03-13')

format(data_query_all$dateTime, "%Y-%m-%d %H:%M:%S", 
       tz=tz_locale, usetz=TRUE) %>% 
  as.POSIXct(tz=tz_locale) %>% 
  adjust_for_daylight_savings(tz_desired = tz_abbr)
Error in if (tz_current == tz_desired) { : the condition has length > 1
lindsayplatt commented 2 years ago

I found a solution where I changed adjust_for_daylight_savings() slightly that will work for any number of timezones at once. Again, this works for me because in the end I am summarizing to daily values and want my days to line up with the timezone that NWIS reports.

# Note that the output from this fxn will say 'PDT' but mean 'PST'
# because you can't have a timezone of 'PST' (it will convert to GMT, 
# even when using `lubridate::force_tz(., 'PST')`). This is used
# internally before dropping time and going to a day, so I am 
# accepting the risk.
adjust_for_daylight_savings <- function(posix_dates, tz_desired) {

  # To go from daylight time (DT) to standard time
  #  (ST), subtract an hour and vice versa. If the
  #  `from` and `to` values are the same, don't 
  #  change anything about the dates.
  tz_conversion_xwalk <- tibble(
    from = c('DT', 'ST', 'ST', 'DT'),
    to = c('ST', 'DT', 'ST', 'DT'),
    conversion_sec = c(-3600, 3600, 0, 0)
  )

  # There could be more than one timezone if the date range spans across
  # the standard to daylight savings switch. Thus, we should be able to 
  # convert each date independently (which happens in this piped sequence)
  tibble(
    in_dates = posix_dates,
    in_tz = format(posix_dates, "%Z")
  ) %>% 
    mutate(
      # Use the last two characters in both the current and desired
      # timezones for matching with the conversion xwalk
      from = stringr::str_sub(in_tz, -2, -1),
      to = stringr::str_sub(tz_desired, -2, -1)
    ) %>% 
    # Join in conversion xwalk
    left_join(tz_conversion_xwalk) %>%
    # Alter the date values to match the desired timezone.
    mutate(out_dates = in_dates + conversion_sec) %>% 
    # Pull out just the dates to return
    pull(out_dates)
}

Example where first date is missing one hour and last date spills over to the following day

library(dplyr)
library(dataRetrieval)

tz_locale <- 'America/Los_Angeles'
tz_abbr <- readNWISsite('373904118570701') %>% pull(tz_cd)

# Example when the query range is during daylight time and not
# savings time - the output times appear to be missing the
# first hour (see `head()` below) and spill over to the next
# day by an hour at the end (see `tail()` below).

data_query <- readNWISuv(siteNumbers = '373904118570701',
                         parameterCd = '72019',
                         startDate = '2022-09-06',
                         endDate = '2022-09-07')

data_query_fmt <- format(data_query$dateTime, "%Y-%m-%d %H:%M:%S", 
                         tz=tz_locale, usetz=TRUE) %>% 
  as.POSIXct(tz=tz_locale)

head(data_query_fmt, 3) # These values need to get switched to match the query range
tail(data_query_fmt, 3) # These values need to get switched to match the query range

Initial output, which is not going to work:

> head(data_query_fmt, 3) # These values need to get switched to match the query range
[1] "2022-09-06 01:00:00 PDT" "2022-09-06 01:05:00 PDT" "2022-09-06 01:10:00 PDT"
> tail(data_query_fmt, 3) # These values need to get switched to match the query range
[1] "2022-09-08 00:45:00 PDT" "2022-09-08 00:50:00 PDT" "2022-09-08 00:55:00 PDT"

Apply the adjust_for_daylight_savings() function.

# We can adjust these dates to appear as `PST` values, so that they line up
# with the queried days. Then, when we summarize per day, the outputs will
# match our intended date range.

data_query_adj <- data_query_fmt %>% 
  adjust_for_daylight_savings(tz_desired = tz_abbr)

head(data_query_adj, 3)
tail(data_query_adj, 3)

Output following the adjustment, where data falls in the appropriate date range:

> head(data_query_adj, 3)
[1] "2022-09-06 00:00:00 PDT" "2022-09-06 00:05:00 PDT" "2022-09-06 00:10:00 PDT"
> tail(data_query_adj, 3)
[1] "2022-09-07 23:45:00 PDT" "2022-09-07 23:50:00 PDT" "2022-09-07 23:55:00 PDT"

Example where the dates span daylight savings time

# This approach still works during daylight savings time, where PST to PDT switch:

data_query_switch <- readNWISuv(siteNumbers = '373904118570701',
                                parameterCd = '72019',
                                startDate = '2022-03-12',
                                endDate = '2022-03-13')

data_query_switch_fmt <- format(data_query_switch$dateTime, "%Y-%m-%d %H:%M:%S", 
                                tz=tz_locale, usetz=TRUE) %>% 
  as.POSIXct(tz=tz_locale)

head(data_query_switch_fmt, 3) # These are fine and don't need to get switched
tail(data_query_switch_fmt, 3) # These values need to switch to PST to match the query range

Initial, problematic data output:

> head(data_query_switch_fmt, 3) # These are fine and don't need to get switched
[1] "2022-03-12 00:00:00 PST" "2022-03-12 00:05:00 PST" "2022-03-12 00:10:00 PST"
> tail(data_query_switch_fmt, 3) # These values need to switch to PST to match the query range
[1] "2022-03-14 00:45:00 PDT" "2022-03-14 00:50:00 PDT" "2022-03-14 00:55:00 PDT"

Applying the adjust_for_daylight_savings() function:

data_query_switch_adj <- data_query_switch_fmt %>% 
  adjust_for_daylight_savings(tz_desired = tz_abbr)

head(data_query_switch_adj, 3)
tail(data_query_switch_adj, 3)

Appropriately adjusted dates:

> head(data_query_switch_adj, 3)
[1] "2022-03-12 00:00:00 PST" "2022-03-12 00:05:00 PST" "2022-03-12 00:10:00 PST"
> tail(data_query_switch_adj, 3)
[1] "2022-03-13 23:45:00 PDT" "2022-03-13 23:50:00 PDT" "2022-03-13 23:55:00 PDT"