Public-Health-Scotland / phsmethods

An R package to standardise methods used in Public Health Scotland (https://public-health-scotland.github.io/phsmethods/)
https://public-health-scotland.github.io/phsmethods/
54 stars 13 forks source link

Some SMR / ICD10 related functions #79

Closed Moohan closed 2 years ago

Moohan commented 2 years ago

At the start of COVID a few of us (Charmaine Walker, @jvillacampa, Cecilia Peuch and Emma Nefaite), worked on a project to churn out lists of people (UPI numbers) who were alive and had had an admission for certain conditions in the previous x years. We developed a pretty comprehensive function to simplify some of the intricacies of using ICD 10 to look for conditions (main vs all conditions, dagger codes etc.) and do an SMR extract. I think it's too specific as is but contains lots of good stuff nonetheless! I wonder if it would be appropriate for phsmethods or if (even generalised) it's too specific a task?

function code ```r get_upi_for_disease <- function(disease_name, icd10_codes = ".", opcs4_codes = ".", lookback_years = NULL, lookback_weeks = NULL, mult_diag = FALSE, mult_proc = FALSE, dagger_codes = ".") { if (icd10_codes == "." && opcs4_codes == ".") { stop("Supply ICD10 or OPCS4 codes") } icd10_codes <- purrr::flatten_chr(purrr::map(icd10_codes, clean_icd10)) opcs4_codes <- purrr::flatten_chr(purrr::map(opcs4_codes, clean_icd10)) dagger_codes <- purrr::flatten_chr(purrr::map(dagger_codes, clean_icd10)) if (is.null(lookback_weeks)) { # Can deal with fractional years e.g. 6 months is 0.5 years date_of_interest <- as.character(lubridate::dmy("01-Feb-2020") - lubridate::months(lookback_years * 12)) } else if (is.null(lookback_years)) { date_of_interest <- as.character(lubridate::dmy("01-Feb-2020") - lubridate::weeks(lookback_weeks)) } else { stop("Only use one of weeks or years") } # Set up the SMRA connection SMRA_connection <- odbc::dbConnect( drv = odbc::odbc(), dsn = "SMRA", uid = rstudioapi::showPrompt(title = "Username", message = "Username:"), pwd = rstudioapi::askForPassword("SMRA Password:"), encoding = "ASCII" ) # Use tbl() from dbplyr to do as much processing as possible in SQL # This all happens in the background and we can just use dplyr commands deaths <- dplyr::tbl( SMRA_connection, dbplyr::in_schema("ANALYSIS", "GRO_DEATHS_C") ) %>% dplyr::select(UPI_NUMBER, DATE_OF_DEATH) %>% dplyr::filter(!is.na(UPI_NUMBER)) %>% dplyr::filter(DATE_OF_DEATH >= To_Date(date_of_interest, "YYYY-MM-DD")) %>% dplyr::distinct(UPI_NUMBER) # Set up a simple SMR01 extract (same as in the previous example) SMR01 <- dplyr::tbl(SMRA_connection, "SMR01_PI") %>% dplyr::select( UPI_NUMBER, ADMISSION_DATE, MAIN_CONDITION, MAIN_OPERATION, OTHER_CONDITION_1, OTHER_CONDITION_2, OTHER_CONDITION_3, OTHER_CONDITION_4, OTHER_CONDITION_5, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3 ) %>% dplyr::filter(!is.na(UPI_NUMBER)) %>% dplyr::filter(ADMISSION_DATE >= To_Date(date_of_interest, "YYYY-MM-DD")) if (mult_diag == F) { SMR01 <- SMR01 %>% dplyr::filter(MAIN_CONDITION %in% icd10_codes | substr(MAIN_OPERATION, 1, 4) %in% opcs4_codes) %>% dplyr::distinct(UPI_NUMBER) } else if (mult_diag == T) { SMR01 <- SMR01 %>% dplyr::filter(MAIN_CONDITION %in% icd10_codes | OTHER_CONDITION_1 %in% icd10_codes | OTHER_CONDITION_2 %in% icd10_codes | OTHER_CONDITION_3 %in% icd10_codes | OTHER_CONDITION_4 %in% icd10_codes | OTHER_CONDITION_5 %in% icd10_codes | substr(MAIN_OPERATION, 1, 4) %in% opcs4_codes) %>% dplyr::distinct(UPI_NUMBER) } else if (mult_proc == T) { SMR01 <- SMR01 %>% dplyr::filter(substr(MAIN_OPERATION, 1, 4) %in% opcs4_codes, substr(OTHER_OPERATION_1, 1, 4) %in% opcs4_codes, substr(OTHER_OPERATION_2, 1, 4) %in% opcs4_codes, substr(OTHER_OPERATION_3, 1, 4) %in% opcs4_codes) %>% dplyr::distinct(UPI_NUMBER) } else if (dagger_codes != ".") { SMR01 <- SMR01 %>% dplyr::filter((MAIN_CONDITION %in% icd10_codes | OTHER_CONDITION_1 %in% icd10_codes | OTHER_CONDITION_2 %in% icd10_codes | OTHER_CONDITION_3 %in% icd10_codes | OTHER_CONDITION_4 %in% icd10_codes | OTHER_CONDITION_5 %in% icd10_codes) | (MAIN_CONDITION %in% dagger_codes & OTHER_CONDITION_1 %in% dagger_codes) ) %>% dplyr::distinct(UPI_NUMBER) } # Remove the dead people from the SMR01 extract alive <- SMR01 %>% dplyr::anti_join(deaths, by = "UPI_NUMBER") # Collect the data alive <- dplyr::collect(alive) %>% dplyr::mutate(disease = disease_name) message( paste( "For", disease_name, "there are", nrow(alive), "people since", date_of_interest, "- excluding deaths and missing CHI numbers" ) ) return(alive) } ```

Usage (in the current form) would be:

data <- extract_fuction(
  disease_name = "Autoimmune hepatitis",
  icd10_codes = c("K754"),
  lookback_years = 10
)
For Autoimmune hepatitis there are 781 people since 2010-02-01 - excluding deaths and missing CHI numbers

The main benefit of this is that it returns the query super quickly as it relies heavily on dbplyr/SQL whilst being pretty simple to use!

There's a helper function needed for the above which takes an ICD 10 code and cleans it up and also returns a vector containing all possible 'lower-level' codes, which makes doing exact searches simpler and readable i.e. MAIN_CONDITION %in% icd_10s rather than substr(MAIN_CONDITION , 1, 4) | ... etc. I think the helper function is more obviously appropriate for phsmethods.

clean_icd10 <- function(icd10) {
  if (icd10 == ".") {
    return(icd10)
  } else if (nchar(icd10) == 6) {
    return(icd10)
  } else if (nchar(icd10) == 5) {
    return(stringr::str_c(icd10, c("D", "A")))
  } else if (nchar(icd10) == 4) {
    return(stringr::str_c(icd10, c("", as.character(0:9), " D", " A")))
  } else if (nchar(icd10) == 3) {
    # Fill out with all possible 4th characters
    return(stringr::str_c(icd10, c(as.character(0:9), "X", ".")))
  } else {
    stop("ICD-10 code is wrong length")
  }
}
> clean_icd10("K12")
 [1] "K120" "K121" "K122" "K123" "K124" "K125" "K126" "K127" "K128" "K129" "K12X" "K12."