Closed EdwinTh closed 5 years ago
Some code I wrote to solve a problem at work. Not general enough yet, but pretty performant already
day_count <- function(df,
first_day,
last_day,
group_var) {
df %>% split(pull(df, !!enquo(group_var))) %>%
map(day_count_single_group, enquo(first_day), enquo(last_day)) %>%
map2_df(.x = ., .y = names(.), function(x, nm) x %>% mutate(group = nm)) %>%
tidyr::complete(day, group) %>%
padr::fill_by_value() %>%
arrange(group)
}
day_count_single_group <- function(df, first_day, last_day) {
first_day_vec <- select(df, !!first_day) %>% pull() %>%
anytime::anydate() %>% as.numeric()
last_day_vec <- select(df, !!last_day) %>% pull() %>%
anytime::anydate() %>% as.numeric()
range <- min(first_day_vec):max(last_day_vec)
counts <- map_int(range,
~sum(first_day_vec <= .x & last_day_vec >= .x))
data_frame(day = anytime::anydate(range),
count = counts)
}
So I completely forgot about difftime
, which does this. However, it does so for whole units only and not for year, quarter and month. So we could do a general purpose API, that is partially a wrapper around difftime
.
Further research, lubridate
does not seem to accommodate for it either, so there is value in creating a wrapper API. Here a quick sketch for diff_year
library(lubridate)
diff_year <- function(x1 = ymd(20180715),
x2 = ymd(20090811)) {
## nr of full years
get_full_years(x1, x2) + get_remainder_years(x1, x2)
}
get_full_years <- function(x1,
x2) {
length(x1 - seq(x2, x1, by = "year")) - 1
}
get_remainder_years <- function(x1,
x2) {
days_left_last_year <- tail(x1 - seq(x2, x1, by = "year"), 1)
# need this to compensate for leap years
x1_year_earlier <- x1
lubridate::year(x1_year_earlier) <- lubridate::year(x1) - 1
as.numeric(days_left_last_year /
length(seq(x1_year_earlier, x1, by = "day")))
}
The other cases (year, month) appear to be covered in length_time
. Although I would have done it differently it is undesired to create something alike.
Oftentimes, data is stored with a start datetime and an end datetime. When we want to integrate this to a number of active count, a strategy might be
However, this grinds to a halt when the number of records gets large.
An alternative approach would be:
Some rudimentary tests already indicated this is more promising. Maybe there is even a faster way, moving the loop over to Rcpp is probably the best way to go.