Closed Trong-SMH closed 2 months ago
Hey @Trong-SMH, I implemented similar changes in n_imaging()
, n_routine_bloodwork()
, and n_rbc_transfusions()
. As with the other functions, the speed improvement is mainly noticeable when running these functions on the whole DB (when subsetting the cohort, the old version of the code is faster - maybe we can dig into that at a later point, but for now I optimized things to be run on the whole DB). Would you be able to take a final quick look at the changes I made (mostly in this commit: https://github.com/GEMINI-Medicine/Rgemini/pull/145/commits/09c10f1ce6b675b80b0fe966bfd3f6f3276592d5)? Once everything looks good to you, I can go ahead and merge the branch, update the package version, and notify the team. Thanks!
Here is the testing script I used:
library(Rgemini)
library(GEMINIpkg)
library(RPostgreSQL)
library(lubridate)
library(data.table)
library(getPass)
library(dplyr)
library(stringr)
source("~/GitHub/Rgemini/R/utils.R")
# Load DB con
drv <- dbDriver("PostgreSQL")
db <-DBI::dbConnect(drv,
dbname="report_db_v3", #report_db_v3 drm_cleandb_v2_1_3
host="prime.smh.gemini-hpc.ca",
port=5432,
user = "lofflera",
password = getPass("Enter Password:"))
adm<-dbGetQuery(db, "Select genc_id, hospital_id from admdad;") %>% data.table()
##### n imaging #####
source("~/GitHub/Rgemini/R/n_imaging.R")
tictoc::tic()
A<-evalq(n_imaging(db, cohort = adm, exclude_ed = FALSE), envir = .GlobalEnv)
tictoc::toc()
# 20 sec elapsed (20 with exclude_ed = TRUE)
tictoc::tic()
B<-Rgemini::n_imaging(db, cohort = adm, exclude_ed = FALSE)
tictoc::toc()
# 102 sec elapsed (71 with exclude_ed = TRUE)
#test_diff
delta<-anti_join(A, B)
nrow(delta)
# [1] 0
##### n routine bloodwork #####
source("~/GitHub/Rgemini/R/n_routine_bloodwork.R")
tictoc::tic()
A<-evalq(n_routine_bloodwork(db, cohort = adm, exclude_ed = FALSE), envir = .GlobalEnv)
tictoc::toc()
# 80 sec elapsed (100 with exclude_ed = TRUE)
tictoc::tic()
B<-Rgemini::n_routine_bloodwork(db, cohort = adm, exclude_ed = FALSE)
tictoc::toc()
# 1120 sec elapsed (152 with exclude_ed = TRUE)
#test_diff
delta<-anti_join(A,B)
nrow(delta)
# [1] 0
##### n RBC transfusions #####
adm$hospital_num <- adm$hospital_id
source("~/GitHub/Rgemini/R/n_rbc_transfusions.R")
tictoc::tic()
A<-evalq(n_rbc_transfusions(db, cohort = adm, exclude_ed = FALSE), envir = .GlobalEnv)
tictoc::toc()
# 110 sec elapsed (130 with exclude_ed = TRUE)
tictoc::tic()
B<-Rgemini::n_rbc_transfusions(db, cohort = adm, exclude_ed = FALSE)
tictoc::toc()
# 280 sec elapsed (410 with exclude_ed = TRUE)
#test_diff
delta<-anti_join(A,B)
nrow(delta)
# [1] 0
I made a test script here to check the difference in output and speed of the where exists implementation and current implementation.
I found there was no change in the output but there is indeed a speed difference. Generally with less number of hospitals the current pasting genc_ids seem to take less time to compute but more than 2-3 hospitals the where exists implementation is faster!
My test script here for reference: