MichaelChirico / funchir

R package of convenient functions
9 stars 2 forks source link

Feature/Compatibility Request: Arrow and/or DuckDB support for `get_age` #18

Open TPDeramus opened 1 month ago

TPDeramus commented 1 month ago

Hi Michael.

Thanks for sharing your utility functions, get_age in particular has come in extremely handy for some of the data I've been working with where age needs to be more precise than the year rounded down.

However, a lot of the data I'm working with happens to be part of VERY large datasets that need to be loaded then mutated in arrow or duckdb tables to work even remotely on a decent scale.

I believe it would be possible to do so using arrow by defining it as a function using register_scalar_function: https://arrow.apache.org/docs/dev/r/reference/register_scalar_function.html

And I had some success with getting it to run this way, but I think the data.table and foverlaps requirements are inadvertently pulling it into R or running all columns at once in the dplyr call and slowing it down in some way.

Would you happen to have some experience with the arrow package to the degree that you might be able to provide some suggestions on to how this could be done for get_age()?

This is hanging just like converting the data to a tibble() and running it in R because there's so much of it:

register_scalar_function(
  "get_age_arrow",
  function(context, birthdays, ref_dates) {
  x <- data.table(bday <- unclass(birthdays),
                  #rem: how many days has it been since the lapse of the
                  #  most recent quadrennium since your birth?
                  rem = ((ref <- unclass(ref_dates)) - bday) %% 1461)
  #cycle_type: which of the four years following your birthday
  #  was the one that had 366 days? 
  x[ , cycle_type := 
       foverlaps(data.table(start = bdr <- bday %% 1461L, end = bdr),
                 #these intervals were calculated by hand;
                 #  e.g., 59 is Feb. 28, 1970. I made the judgment
                 #  call to say that those born on Feb. 29 don't
                 #  have their "birthday" until the following March 1st.
                 data.table(start = c(0L, 59L, 424L, 790L, 1155L), 
                            end = c(58L, 423L, 789L, 1154L, 1460L), 
                            val = c(3L, 2L, 1L, 4L, 3L),
                            key = "start,end"))$val]
  I4 <- diag(4L)[ , -4L] #for conciseness below
  #The `by` approach might seem a little abstruse for those
  #  not familiar with `data.table`; see the edit history
  #  for a more palatable version (which is also slightly slower)
  x[ , extra := 
       foverlaps(data.table(start = rem, end = rem),
                 data.table(start = st <- cumsum(c(0L, rep(365L, 3L) +
                                                     I4[.BY[[1L]],])),
                            end = c(st[-1L] - 1L, 1461L),
                            int_yrs = 0:3, key = "start,end")
       )[ , int_yrs + (i.start - start) / (end + 1L - start)], by = cycle_type]
  #grand finale -- 4 years for every quadrennium, plus the fraction:
  4L * ((ref - bday) %/% 1461L) + x$extra
},
  in_type = schema(birthdays = date32(), ref_dates = date32()),
  out_type = float64(),
  auto_convert = TRUE
)

Thanks in advance!