darwin-eu-dev / PatientProfiles

https://darwin-eu-dev.github.io/PatientProfiles/
Apache License 2.0
6 stars 5 forks source link

Strange and possibly subtle error with summarizeLargeScaleCharacteristics that only shows up on Iqvia drug exposure data #600

Closed ablack3 closed 5 months ago

ablack3 commented 5 months ago

During the MDD study we ran large scale characteristics. In the output we get results for drug exposures within the various age and sex strata. However we do not get any results for the "overall" strata. Perhaps we are misunderstanding how this function should work. We expect that any records captured by a specific age or sex strata would also be captured by the "overall" strata.

In all databases except Iqvia we have drug results in the overall stata. In iqvia we have condition results in the overall strata but no drug results.

Here is the code we ran.

#' Get the patient characteristics.
#'
#' @param cdm cdm connector reference created with `CDMConnector::cdm_from_con()`
#' @param logger logger object
#' @param minCellCount minimum cell count to report in results
#'
#' @return a list of the updated cdm object and the patient characteristics
#'
#' @export
runPatientCharacterisation <- function(cdm, 
                                       logger = log4r::logger(), 
                                       minCellCount = 0) {

  checkmate::assert_class(cdm, "cdm_reference")
  checkmate::assert_class(logger, "logger")
  checkmate::assert_integerish(minCellCount, lower = 0, any.missing = FALSE)

  targetTableName <- "mdd_large_char"

  log4r::info(logger, "Creating the MDD cohort for large scale characterization cohorts")
  mddCohortSet <- CDMConnector::readCohortSet(system.file("cohorts", "first_mdd_dx", package = "P2C1008DUSMDD", mustWork = TRUE))
  stopifnot(nrow(mddCohortSet) == 1)

  cdm <- CDMConnector::generateCohortSet(cdm = cdm,
                                         cohortSet = mddCohortSet,
                                         name = targetTableName,
                                         overwrite = TRUE)

  log4r::info(logger, "Getting patient characteristics")

  cdm[[targetTableName]] <- cdm[[targetTableName]] %>%
    PatientProfiles::addAge(
      ageGroup = list(c(12, 17),
                      c(18, 44),
                      c(45, 64),
                      c(65, 150)),
      ageDefaultMonth = 1,
      ageDefaultDay = 1,
      ageImposeMonth = TRUE,
      ageImposeDay = TRUE) %>%
    PatientProfiles::addSex() %>%
    dplyr::filter(age >= 12) %>% 
    CDMConnector::computeQuery()

  log4r::info(logger, "Running drug characterisation")
  suppressWarnings({
    drugResult <- cdm[[targetTableName]] %>%
      PatientProfiles::summariseLargeScaleCharacteristics(cdm = cdm,
                                                          strata = list("sex", "age_group"),
                                                          window = list(c(-Inf,-1),c(-365, -31), c(-31, -1)),
                                                          eventInWindow = "drug_exposure",
                                                          minCellCount = minCellCount)
  })

  log4r::info(logger, "Running condition characterisation")
  suppressWarnings({
    conditionResult <- cdm[[targetTableName]] %>%
      PatientProfiles::summariseLargeScaleCharacteristics(cdm = cdm,
                                                          strata = list("sex", "age_group"),
                                                          window = list(c(-Inf,-1),c(-365, -31), c(-31, -1)),
                                                          eventInWindow = "condition_occurrence",
                                                          minCellCount = minCellCount)
  })

  result <- dplyr::bind_rows(drugResult, conditionResult) %>% 
    dplyr::mutate(cdm_name = CDMConnector::cdmName(cdm)) %>% 
    dplyr::select("cdm_name", dplyr::everything())

  return(result)
}

Here are the iqvia results. patient_characterisation.csv

When you filter for "overall" strata you will find no drug exposures.

It is possible this could require investigation to figure out and could be an issue with database rather than the software. So at this point let's first establish that the result is indeed unexpected.

@catalamarti would you expect that the overall strata would include any records captured by the age and sex strata? (i.e. "overall" includes all strata?)

Also tagging @mderidder95 who identified this issue.

ablack3 commented 5 months ago

using PP v 0.5.1

image
mderidder95 commented 5 months ago

Correction: There are 'overall' records for drug exposure for windows (-Inf,-1) and (-365, -31). But NOT for window (-31, -1).

ablack3 commented 5 months ago

one question - would we expect the overall counts to be the sum of the strata counts. So for example in a specific time window for a specific drug, if we have 10 records for males and 5 records for females would we expect 15 records for overall?

catalamarti commented 5 months ago

hi @mderidder95 @ablack3 I think that your unexpected results are due to the minimumFrequency argument https://darwin-eu-dev.github.io/PatientProfiles/reference/summariseLargeScaleCharacteristics.html Only counts with a higher percentage than 0.5 are reported. This is set to reduce the large amount of data produced by this function, but can be turned of if minimumFrequency is set to 0.

The non-matching counts, I know that some individuals in iqvia have missing sex. And None will not be displayed if they are suppressed due to minimumCellCount or minimumFrequency. Note that when None is reported: e.g. concept = 41042861, he numbers do match

ablack3 commented 5 months ago

Thanks for the explanation Marti. We could try rerunning just this step of the study and set minimumFrequency = 0 and see if the results change as expected on iqvia.

ablack3 commented 5 months ago

Ok I think this is solved. Thank you @catalamarti !!