Public-Health-Scotland / phsmethods

An R package to standardise methods used in Public Health Scotland (https://public-health-scotland.github.io/phsmethods/)
https://public-health-scotland.github.io/phsmethods/
54 stars 13 forks source link

extract_fin_year not matching PHS format. #120

Closed Nic-Chr closed 7 months ago

Nic-Chr commented 7 months ago

Hello, the documentation for extract_fin_year states that the "format for financial year is YYYY/YY" though for certain years this format is not applied. Please see the below reprex.

packageVersion("phsmethods")
#> [1] '1.0.0'

library(tidyverse)
extract_fin_year <- function(date){
  if (!inherits(date, c("Date", "POSIXt"))) {
    cli::cli_abort("{.arg date} must be a {.cls Date} or {.cls POSIXt} vector,\n not a {.cls {class(date)}} vector.")
  }
  y <- lubridate::year(date)
  m <- lubridate::month(date)
  fy <- y - ( (m - 3) %/% 1  <= 0)
  next_fy <- (fy + 1) %% 100
  out <- sprintf("%.4d/%02d", fy, next_fy)
  out[is.na(date)] <- NA_character_
  out
}

x <- dmy("01/April/1999")

cat("phsmethods: ", phsmethods::extract_fin_year(x))
#> phsmethods:  1999/100
cat("expected: ", extract_fin_year(x))
#> expected:  1999/00

x <- dmy("01/December/2000")

cat("phsmethods: ", phsmethods::extract_fin_year(x))
#> phsmethods:  2000/1
cat("expected: ", extract_fin_year(x))
#> expected:  2000/01

x <- dmy("01/December/0001")

cat("phsmethods: ", phsmethods::extract_fin_year(x))
#> phsmethods:  1/2
cat("expected: ", extract_fin_year(x))
#> expected:  0001/02

Created on 2023-11-16 with reprex v2.0.2

Nic-Chr commented 7 months ago

Below I include a benchmark of the above function compared to phsmethods::extract_fin_year() as well as tests to ensure they produce the same financial years.

library(tidyverse)
library(lubridate)
library(bench)

extract_fin_year <- function(date){
  if (!inherits(date, c("Date", "POSIXt"))) {
    cli::cli_abort("{.arg date} must be a {.cls Date} or {.cls POSIXt} vector,\n not a {.cls {class(date)}} vector.")
  }
  y <- lubridate::year(date)
  m <- lubridate::month(date)
  fy <- y - ( (m - 3) %/% 1  <= 0)
  next_fy <- (fy + 1) %% 100
  out <- sprintf("%.4d/%02d", fy, next_fy)
  out[is.na(date)] <- NA_character_
  out
}

# Single date
x <- dmy(01012023)

mark(extract_fin_year(x),
     phsmethods::extract_fin_year(x))
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 extract_fin_year(x)                 52µs     58µs    15339.    5.88KB     17.7
#> 2 phsmethods::extract_fin_year(x)    150µs    164µs     5663.   47.83KB     19.1

# Single date-time
x <- as_datetime(x)

mark(extract_fin_year(x),
     phsmethods::extract_fin_year(x))
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 extract_fin_year(x)               42.3µs   46.3µs    19650.        0B     19.3
#> 2 phsmethods::extract_fin_year(x)  118.8µs  131.3µs     6957.        0B     21.4

# Lots of dates
x <- dmy(01012000) + days(0:10^6)

mark(extract_fin_year(x),
     phsmethods::extract_fin_year(x), 
     check = FALSE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 extract_fin_year(x)                1.59s    1.59s     0.627     202MB     2.51
#> 2 phsmethods::extract_fin_year(x)    6.62s    6.62s     0.151     549MB     1.06

# Helper to conert phs::extract_fin_year to newer format 

### This is ONLY FOR TESTING and not to be used in any production code.
phs_convert_fin_year <- function(x){
  # All digits before slash
  digits1 <- str_extract(x, regex("^[:digit:]*(?=\\/)"))
  # All digits after slash
  digits2 <- str_extract(x, regex("(?<=\\/)[:digit:]*"))
  # Use YYYY/YY format
  out <- sprintf("%.4d/%02d", as.integer(digits1), as.integer(digits2) %% 100L)
  out[is.na(x)] <- NA_character_
  out
}

# Generate random dates

x <- seq(dmy(01011900), today(), by = "day")
x <- sample(x, size = 10^5, replace = TRUE)
x[sample.int(length(x), 10^2, TRUE)] <- NA

# Here we're making sure the new function produces correct financial years.

identical(extract_fin_year(x), 
          phs_convert_fin_year(
            phsmethods::extract_fin_year(x)
          ))
#> [1] TRUE

Created on 2023-11-16 with reprex v2.0.2

Moohan commented 7 months ago

Thanks for spotting and reporting this issue @Nic-Chr

If you have time, would you be able to put this together into a PR? Other than the code you've quoted above I think we would need to add some more tests (clearly some cases are being missed from the current set of tests).

Give me a message if you have any questions about the PR or testthat.

If you won't have time, just let us know and (@Tina815 or) I will implement it at some point!

Nic-Chr commented 7 months ago

Thanks @Moohan, sounds good. I'll go ahead and put this into a PR with some additional tests.