Public-Health-Scotland / phsmethods

An R package to standardise methods used in Public Health Scotland (https://public-health-scotland.github.io/phsmethods/)
https://public-health-scotland.github.io/phsmethods/
54 stars 13 forks source link

Improve `create_age_groups` #135

Open Nic-Chr opened 3 days ago

Nic-Chr commented 3 days ago

Hi, I think we can improve the speed of create_age_groups quite a bit and also remove the dependency on 'utils' package.

If we avoid cut() which is inefficient in creating factors as it goes through unnecessary unique() + match() steps internally. We already have our cleaned age breaks which are unique and sorted, meaning we can avoid using cut() and directly use .bincode(). .bincode() is basically a low-level factor constructor and also what cut() uses as well. To get a character vector, all that's needed is to subset our age breaks onto our bin codes.

On the topic of cut() inefficiency, there is a stack thread I opened a while ago: https://stackoverflow.com/questions/76867914/can-cut-be-improved

Proposed function and benchmark:

create_age_groups <- function(x, from = 0, to = 90, by = 5, as_factor = FALSE){

  if (!is.numeric(x)) {
    cli::cli_abort("{.arg x} must be a {.cls numeric} vector, not a {.cls {class(x)}} vector.")
  }

  breaks <- seq(from, to, by)
  breaks <- sort(unique(breaks))
  n_breaks <- length(breaks)
  n <- max(n_breaks - 1L, 0L)

  bands <- paste0(
    breaks[seq_len(n)], "-", 
    breaks[seq.int(to = n_breaks, length.out = n)] - 1L
  )

  rightmost_band <- paste0(breaks[n_breaks], "+")
  bands[n_breaks] <- rightmost_band

  codes <- .bincode(x, breaks = c(breaks, Inf), right = FALSE)

  if (as_factor) {
    out <- codes
    levels(out) <- bands
    class(out) <- c("ordered", "factor")
  }
  else {
    out <- bands[codes]
  }
  out
}
library(bench)

x <- 20
create_age_groups(x)
#> [1] "20-24"
phsmethods::create_age_groups(x)
#> [1] "20-24"

mark(create_age_groups(x), 
     phsmethods::create_age_groups(x))
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x)               98.6µs   103µs     9202.    2.41KB    10.4 
#> 2 phsmethods::create_age_groups(x)  155.5µs   168µs     5697.    3.84KB     8.30
mark(create_age_groups(x, as_factor = TRUE), 
     phsmethods::create_age_groups(x, as_factor = TRUE))
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x, as_factor = TR… 101µs  106µs     8998.    2.71KB    12.7 
#> 2 phsmethods::create_age_groups(x, as… 153µs  162µs     5868.    3.84KB     8.30

x <- sample(0:100, 10^7, T)
mark(create_age_groups(x), 
     phsmethods::create_age_groups(x))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x)                414ms   416ms      2.40     191MB     2.40
#> 2 phsmethods::create_age_groups(x)    701ms   701ms      1.43     420MB     4.28
mark(create_age_groups(x, as_factor = TRUE), 
     phsmethods::create_age_groups(x, as_factor = TRUE))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x, as_factor = TR… 361ms  361ms      2.77     153MB     2.77
#> 2 phsmethods::create_age_groups(x, as… 654ms  654ms      1.53     343MB     1.53

Created on 2024-09-19 with reprex v2.0.2

This obviously relates to issues #93 and #54, which I think are also worthwhile but as subsequent step.

Moohan commented 3 days ago

I'm always keen on speed/memory improvements (especially if they work on larger datasets). My usual approach when I've made changes like this in the past to other functions is to start by expanding the tests for the function(s) so I'm 100% sure there's no unintended regressions or behaviour changes.

I've also been interested in https://lorenzwalthert.github.io/touchstone/ for a while which is meant for exactly these types of improvements - it involves a bit of setup but then you get a benchmark comment added to PRs.