darwin-eu-dev / PatientProfiles

https://darwin-eu-dev.github.io/PatientProfiles/
Apache License 2.0
6 stars 5 forks source link

Possibly a double-counting bug in addTableIntersectCount #671

Closed OskarGauffin closed 2 months ago

OskarGauffin commented 2 months ago

Hi,

found this in the exercises of the Oxford RWE summer school, looks a bit like a bug to me, but I may be incorrect.

It's the exercise where we're supposed to find the average number of prescriptions, using patientprofiles.

//Oskar

`############################### library(CodelistGenerator) library(CDMConnector) library(duckdb) library(PatientProfiles) library(dplyr)

con <- dbConnect(duckdb(), eunomia_dir()) cdm <- cdmFromCon(con = con, cdmSchema = "main", writeSchema = "main")

cdm <- generateConceptCohortSet( cdm = cdm, name = "sinusitis", conceptSet = list( "bacterial_sinusitis" = 4294548, "viral_sinusitis" = 40481087, "chronic_sinusitis" = 257012, "any_sinusitis" = c(4294548, 40481087, 257012) ), limit = "all", end = 0 )

##############

solution:

cdm$sinusitis |> addTableIntersectCount( tableName = "drug_exposure", window = c(-Inf, Inf), targetEndDate = NULL, nameStyle = "number_prescriptions" ) |> filter(cohort_definition_id == 2) |> # Filter on cohort after intersection. summarise(mean_prescription = mean(number_prescriptions))

gives you 50.

cdm$sinusitis |> filter(cohort_definition_id == 2) |> # Filter on cohort after intersection. addTableIntersectCount( tableName = "drug_exposure", window = c(-Inf, Inf), targetEndDate = NULL, nameStyle = "number_prescriptions" ) |> summarise(mean_prescription = mean(number_prescriptions))

gives you 25.

Which one is correct?

Check number of prescriptions for subject_id = 806

This subject belongs in all four sinusitis cohorts:

cdm$sinusitis |> filter(subject_id == 806) |> distinct(subject_id, cohort_definition_id)

And there is 21 drugs in the drug_exposure table for person_id = 806.

cdm$drug_exposure |> filter(person_id == 806) |> count()

##################################################

cdm$sinusitis |> filter(subject_id == 806) |> filter(cohort_definition_id == 2) |> ################ FILTER on cohort before intersection addTableIntersectCount( tableName = "drug_exposure", window = list(c(-Inf, Inf)), nameStyle = "number_prescriptions" ) |> pull("number_prescriptions") |> mean()

21 - correct.

##################

cdm$sinusitis |> filter(subject_id == 806) |> addTableIntersectCount( tableName = "drug_exposure", window = list(c(-Inf, Inf)), nameStyle = "number_prescriptions" ) |> filter(cohort_definition_id == 2) |> ################ FILTER on cohort after intersection pull("number_prescriptions") |> mean()

42. Double the correct answer. I find it a bit surprising / possible bug

that the count is doubled by not filtering on the cohort before the intersection.

`

catalamarti commented 2 months ago

thanks for reporting @OskarGauffin and thank @ilovemane for fixing it