EdwinTh / padr

Padding of missing records in time series
https://edwinth.github.io/padr/
Other
132 stars 12 forks source link

Add an interval counting function #57

Closed EdwinTh closed 5 years ago

EdwinTh commented 6 years ago

Oftentimes, data is stored with a start datetime and an end datetime. When we want to integrate this to a number of active count, a strategy might be

gather(d, day, -id) %>%
pad(interval = "day", break_above = 10^10, group = "id") %>% 
count(day)

However, this grinds to a halt when the number of records gets large.

An alternative approach would be:

 all_vals   <- as.numeric(min(day), max(day)) %>% as.numeric()
 days_present <- matrix(0, nrow(df), length(all_vals))
 for (i in 1:nrow(df)) {
     days_present[i, ] <- all_vals >= set[i, ]$first_day & all_vals <= set[i, ]$last_day
 }

 day_count <- colSums(days_present)
 day_set <- data_frame(day = seq(min(day), max(day), by = "day"),
                                      cnt = day_count)

Some rudimentary tests already indicated this is more promising. Maybe there is even a faster way, moving the loop over to Rcpp is probably the best way to go.

EdwinTh commented 6 years ago

Some code I wrote to solve a problem at work. Not general enough yet, but pretty performant already

day_count <- function(df, 
                      first_day, 
                      last_day,
                      group_var) {
  df %>% split(pull(df, !!enquo(group_var))) %>% 
    map(day_count_single_group, enquo(first_day), enquo(last_day)) %>% 
    map2_df(.x = ., .y = names(.), function(x, nm) x %>% mutate(group = nm)) %>% 
    tidyr::complete(day, group) %>% 
    padr::fill_by_value() %>% 
    arrange(group)
}

day_count_single_group <- function(df, first_day, last_day) {
  first_day_vec <- select(df, !!first_day) %>% pull() %>% 
    anytime::anydate() %>% as.numeric()
  last_day_vec  <- select(df, !!last_day)  %>% pull() %>% 
    anytime::anydate() %>% as.numeric()
  range <- min(first_day_vec):max(last_day_vec)
  counts <- map_int(range, 
                    ~sum(first_day_vec <= .x & last_day_vec >= .x))
  data_frame(day   = anytime::anydate(range),
             count = counts)
}
EdwinTh commented 5 years ago

So I completely forgot about difftime, which does this. However, it does so for whole units only and not for year, quarter and month. So we could do a general purpose API, that is partially a wrapper around difftime.

EdwinTh commented 5 years ago

Further research, lubridate does not seem to accommodate for it either, so there is value in creating a wrapper API. Here a quick sketch for diff_year

library(lubridate)
diff_year <- function(x1 = ymd(20180715), 
                      x2 = ymd(20090811)) {
  ## nr of full years
  get_full_years(x1, x2) + get_remainder_years(x1, x2)
}

get_full_years <- function(x1,
                           x2) {
  length(x1 - seq(x2, x1, by = "year")) - 1
}

get_remainder_years <- function(x1, 
                                x2) {
  days_left_last_year <- tail(x1 - seq(x2, x1, by = "year"), 1)
  # need this to compensate for leap years
  x1_year_earlier <- x1
  lubridate::year(x1_year_earlier) <- lubridate::year(x1) - 1
  as.numeric(days_left_last_year / 
               length(seq(x1_year_earlier, x1, by = "day")))
}
EdwinTh commented 5 years ago

The other cases (year, month) appear to be covered in length_time. Although I would have done it differently it is undesired to create something alike.