ctsit / rcc.billing

Automated, data-driven service billing implemented on REDCap Custodian
https://ctsit.github.io/rcc.billing/
Apache License 2.0
0 stars 3 forks source link

Refactor get orphaned projects, go faster, and add reasons #171

Closed pbchase closed 1 year ago

pbchase commented 1 year ago

This PR does a few different things. Yes, it's a little overloaded. Whatever!

Refactor

It starts with two refactor commits. The first, _Refactor SQLite out of get_orphanedprojects and its tests, was to switch the testing DB to Duck DB so we can refactor the date hacks out of get_orphaned_projects.

I did that so I could add dbplyr-centered performance enhancements to get_orphaned_projects in _Add speed improvements to get_orphanedprojects It nows runs against prod in about 2 seconds instead of 30+. The enhancements were multifold. Adding filters to project_ownership before joining it was a huge speed boost. The second thing I did was to combine filter statements wherever possible. In some cases this provides very large gains, you will see this appear more often if I need a boost to dbplyr speed.

FYI, I tested the shit out of the refactored code. not only does it pass the units tests, I iterated both the old and new code across 5 months of Prod data and got a perfect match of the output. The testing harness was interesting. Thsi is what I did:

# steal preamble from sequester_orphans
library(rcc.billing)
library(redcapcustodian)
library(RMariaDB)
library(DBI)
library(tidyverse)
library(lubridate)
library(dotenv)
library(sendmailR)
library(tableHTML)

init_etl("my_awesome_test")

rc_conn <- connect_to_redcap_db()
rcc_billing_conn <- connect_to_rcc_billing_db()

# Wrap a copy of the new code (not included here) with a function that reorders the parameter list for Purrr
# and appends the months, previous so I can tell which call (month) found what problems
get_orphaned_projects_for_iteration_new <- function(months_previous, rc_conn, rcc_billing_conn) {
  get_orphaned_projects_new(
    rc_conn = rc_conn,
    rcc_billing_conn = rcc_billing_conn,
    months_previous = months_previous) %>%
    mutate(months_previous = months_previous)
}

# Do as above but for the old code (also not included here)
get_orphaned_projects_for_iteration_old <- function(months_previous, rc_conn, rcc_billing_conn) {
  get_orphaned_projects_old(
    rc_conn = rc_conn,
    rcc_billing_conn = rcc_billing_conn,
    months_previous = months_previous) %>%
    mutate(months_previous = months_previous)
}

# make a vector of months to iterate over
months_previous <- c(
  0,1,2,3,4
)  

# Run the new and the old cold over the same vector of months
new <- purrr::map_dfr(months_previous, get_orphaned_projects_for_iteration_new, rc_conn, rcc_billing_conn)
old <- purrr::map_dfr(months_previous, get_orphaned_projects_for_iteration_old, rc_conn, rcc_billing_conn)

# verify the awesome
all_equal(new, old)
# Be happy :-)

Feature, not bug

I worked my tail off to get the speed improvements above so I could chase what I expected to be a nasty bug. It turns out the the bug was OK code not communicating things clearly to Taryn or me. it confused us both. Thus I created Add reason to project sequestration messages. It does these things:

It also does this one completely unrelated thing that Sandra reminded me of just as I was committing this code: