Improve `create_age_groups`

Hi, I think we can improve the speed of create_age_groups quite a bit and also remove the dependency on 'utils' package.

If we avoid cut() which is inefficient in creating factors as it goes through unnecessary unique() + match() steps internally. We already have our cleaned age breaks which are unique and sorted, meaning we can avoid using cut() and directly use .bincode(). .bincode() is basically a low-level factor constructor and also what cut() uses as well. To get a character vector, all that's needed is to subset our age breaks onto our bin codes.

On the topic of cut() inefficiency, there is a stack thread I opened a while ago: https://stackoverflow.com/questions/76867914/can-cut-be-improved

Proposed function and benchmark:

create_age_groups <- function(x, from = 0, to = 90, by = 5, as_factor = FALSE){

  if (!is.numeric(x)) {
    cli::cli_abort("{.arg x} must be a {.cls numeric} vector, not a {.cls {class(x)}} vector.")
  }

  breaks <- seq(from, to, by)
  breaks <- sort(unique(breaks))
  n_breaks <- length(breaks)
  n <- max(n_breaks - 1L, 0L)

  bands <- paste0(
    breaks[seq_len(n)], "-", 
    breaks[seq.int(to = n_breaks, length.out = n)] - 1L
  )

  rightmost_band <- paste0(breaks[n_breaks], "+")
  bands[n_breaks] <- rightmost_band

  codes <- .bincode(x, breaks = c(breaks, Inf), right = FALSE)

  if (as_factor) {
    out <- codes
    levels(out) <- bands
    class(out) <- c("ordered", "factor")
  }
  else {
    out <- bands[codes]
  }
  out
}
library(bench)

x <- 20
create_age_groups(x)
#> [1] "20-24"
phsmethods::create_age_groups(x)
#> [1] "20-24"

mark(create_age_groups(x), 
     phsmethods::create_age_groups(x))
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x)               98.6µs   103µs     9202.    2.41KB    10.4 
#> 2 phsmethods::create_age_groups(x)  155.5µs   168µs     5697.    3.84KB     8.30
mark(create_age_groups(x, as_factor = TRUE), 
     phsmethods::create_age_groups(x, as_factor = TRUE))
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x, as_factor = TR… 101µs  106µs     8998.    2.71KB    12.7 
#> 2 phsmethods::create_age_groups(x, as… 153µs  162µs     5868.    3.84KB     8.30

x <- sample(0:100, 10^7, T)
mark(create_age_groups(x), 
     phsmethods::create_age_groups(x))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x)                414ms   416ms      2.40     191MB     2.40
#> 2 phsmethods::create_age_groups(x)    701ms   701ms      1.43     420MB     4.28
mark(create_age_groups(x, as_factor = TRUE), 
     phsmethods::create_age_groups(x, as_factor = TRUE))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x, as_factor = TR… 361ms  361ms      2.77     153MB     2.77
#> 2 phsmethods::create_age_groups(x, as… 654ms  654ms      1.53     343MB     1.53

^{Created on 2024-09-19 with reprex v2.0.2}

This obviously relates to issues #93 and #54, which I think are also worthwhile but as subsequent step.

Public-Health-Scotland / phsmethods

Improve `create_age_groups` #135