EdwinTh / padr

Padding of missing records in time series
https://edwinth.github.io/padr/
Other
132 stars 12 forks source link

rounding options in thicken #65

Closed stribstrib closed 5 years ago

stribstrib commented 5 years ago

I've recently run into an issue rounding up with thicken(), and am wondering if another rounding option could be added, or perhaps a solution exists elsewhere within padr.

Some sample data with a base time interval of 15 minutes:

flow.data <- 
  tibble::tribble(
    ~DATETIME, ~RECORD, ~Flow_1, ~Flow_2, ~Flow_3,
    "2016-05-30 16:00:00",      0L,   235.3,   257.7,       0,
    "2016-05-30 16:15:00",      1L,   219.6,   266.4,       0,
    "2016-05-30 16:30:00",      2L,   236.3,   274.4,   588.6,
    "2016-05-30 16:45:00",      3L,   233.8,   165.1,   387.6,
    "2016-05-30 17:00:00",      4L,   190.3,     118,   223.5,
    "2016-05-30 17:15:00",      5L,   157.5,   114.5,       0,
    "2016-05-30 17:30:00",      6L,   150.7,   144.5,       0,
    "2016-05-30 17:45:00",      7L,     138,   153.7,   686.5,
    "2016-05-30 18:00:00",      8L,   127.3,   128.8,   331.1,
    "2016-05-30 18:15:00",      9L,   119.6,   103.7,   263.5
  )

When rounding down to the nearest hour, thicken maps the input variable to the closest spanned variable less than or equal to itself. For records that land directly on the hour interval, the timestamps match. This makes intuitive sense to me.

#Rounding down
flow.data %>% 
  thicken(interval = "hour", rounding = "down") %>% 
  select(DATETIME, DATETIME_hour, everything())
# A tibble: 10 x 6
   DATETIME            DATETIME_hour       RECORD Flow_1 Flow_2 Flow_3
   <dttm>              <dttm>               <int>  <dbl>  <dbl>  <dbl>
 1 2016-05-30 16:00:00 2016-05-30 16:00:00      0   235.   258.     0 
 2 2016-05-30 16:15:00 2016-05-30 16:00:00      1   220.   266.     0 
 3 2016-05-30 16:30:00 2016-05-30 16:00:00      2   236.   274.   589.
 4 2016-05-30 16:45:00 2016-05-30 16:00:00      3   234.   165.   388.
 5 2016-05-30 17:00:00 2016-05-30 17:00:00      4   190.   118    224.
 6 2016-05-30 17:15:00 2016-05-30 17:00:00      5   158.   114.     0 
 7 2016-05-30 17:30:00 2016-05-30 17:00:00      6   151.   144.     0 
 8 2016-05-30 17:45:00 2016-05-30 17:00:00      7   138    154.   686.
 9 2016-05-30 18:00:00 2016-05-30 18:00:00      8   127.   129.   331.
10 2016-05-30 18:15:00 2016-05-30 18:00:00      9   120.   104.   264.

For rounding up to the nearest hour, records that land directly on the hour interval are mapped to one full interval ahead of their original timestamp. This surprised me, and seems counter-intuitive. I did see that this was explained in the padr implementation vignette, where it's specifies that rounding up maps the input variable to the closest value greater than (but not equal to) itself.

#Rounding up
flow.data %>% 
  thicken(interval = "hour", rounding = "up") %>% 
  select(DATETIME, DATETIME_hour, everything())
# A tibble: 10 x 6
   DATETIME            DATETIME_hour       RECORD Flow_1 Flow_2 Flow_3
   <dttm>              <dttm>               <int>  <dbl>  <dbl>  <dbl>
 1 2016-05-30 16:00:00 2016-05-30 17:00:00      0   235.   258.     0 
 2 2016-05-30 16:15:00 2016-05-30 17:00:00      1   220.   266.     0 
 3 2016-05-30 16:30:00 2016-05-30 17:00:00      2   236.   274.   589.
 4 2016-05-30 16:45:00 2016-05-30 17:00:00      3   234.   165.   388.
 5 2016-05-30 17:00:00 2016-05-30 18:00:00      4   190.   118    224.
 6 2016-05-30 17:15:00 2016-05-30 18:00:00      5   158.   114.     0 
 7 2016-05-30 17:30:00 2016-05-30 18:00:00      6   151.   144.     0 
 8 2016-05-30 17:45:00 2016-05-30 18:00:00      7   138    154.   686.
 9 2016-05-30 18:00:00 2016-05-30 19:00:00      8   127.   129.   331.
10 2016-05-30 18:15:00 2016-05-30 19:00:00      9   120.   104.   264.

Instead, I'm hoping for a way to use thicken() and get results like this:

# A tibble: 10 x 6
   DATETIME            DATETIME_hour       RECORD Flow_1 Flow_2 Flow_3
   <dttm>              <dttm>               <int>  <dbl>  <dbl>  <dbl>
 1 2016-05-30 16:00:00 2016-05-30 16:00:00      0   235.   258.     0 
 2 2016-05-30 16:15:00 2016-05-30 17:00:00      1   220.   266.     0 
 3 2016-05-30 16:30:00 2016-05-30 17:00:00      2   236.   274.   589.
 4 2016-05-30 16:45:00 2016-05-30 17:00:00      3   234.   165.   388.
 5 2016-05-30 17:00:00 2016-05-30 17:00:00      4   190.   118    224.
 6 2016-05-30 17:15:00 2016-05-30 18:00:00      5   158.   114.     0 
 7 2016-05-30 17:30:00 2016-05-30 18:00:00      6   151.   144.     0 
 8 2016-05-30 17:45:00 2016-05-30 18:00:00      7   138    154.   686.
 9 2016-05-30 18:00:00 2016-05-30 18:00:00      8   127.   129.   331.
10 2016-05-30 18:15:00 2016-05-30 19:00:00      9   120.   104.   264.

This method of rounding up that I'm looking for is in keeping with the way dataloggers (at least those I'm familiar with) summarize their time series data, where timestamps represent the end of the period summarized. In this sample data for instance, the record with an original timestamp of 17:00:00 (row 5) represents a summary of data collected between 16:45:01 and 17:00:00. Linking these data to a timestamp of 18:00:00 would be a significant misrepresentation of when the data was actually collected.

Would it be possible to create a rounding option that would map the input value to the nearest interval greater than or equal to itself?

** Brand new to github, so apologies if I over-explained or committed some other gaffe.

EdwinTh commented 5 years ago

Thank you for this detailed and very well-explained issue.

I cannot remember exactly why I chose to implement it in this way. I think I looked for rounding down and rounding up being complementary. But from your example it indeed makes sense to round to equal or larger, when rounding up as well. I will think about how to include that option, while keeping the function backwards compatible (meaning that it is possible, but not the default).

Thanks again.