darwin-eu-dev / PatientProfiles

https://darwin-eu-dev.github.io/PatientProfiles/
Apache License 2.0
6 stars 5 forks source link

user feedback: variables without matching functions are silently ignored in summarizeResult #649

Closed ablack3 closed 3 months ago

ablack3 commented 3 months ago

If a variable has no matching functions it is silently left out of the results which is somewhat surprising.

To know what results I will get I need to know the types of the variables in the variables argument and which functions go with which types.

In the example below I need to know that sex is a categorical variable and since no functions passed in to the extimate argument apply to categorical variables, sex will not be included in the results.

I think the previous interface more intuitive where the types of the variables and the functions that apply to each type was explicit in the code and not implicitly handled inside the function.

numericVariables = .... numericFunction = .... categoricalVariables = ... categoricalFunctions =... etc

I'm not suggesting it gets changed back(although we could discuss it) but just that more feedback is collected before interface changes are made.

In this case it might be good print a message if the user asks to summarize a variable but there are no functions requested that apply to that variable.

The thing that connects these two arguments is the type of the variable which is completely implicit now. You can't know which functions apply to which variables just by looking at the arguments anymore.

    variables = c("some_variable", "some_other_variable"), 
    estimates = c("mean", "sd", "count"),
library(duckdb)
#> Loading required package: DBI
library(CDMConnector)
library(PatientProfiles)
library(dplyr)
library(CodelistGenerator)

cdm <- cdmFromCon(
  con = dbConnect(duckdb(), eunomia_dir()), 
  cdmSchema = "main", 
  writeSchema = "main"
)

cdm <- generateConceptCohortSet(
  cdm = cdm, 
  conceptSet = list("sinusitis" = c(4294548, 4283893, 40481087, 257012)), 
  limit = "first",
  name = "my_cohort"
)

x <- cdm$my_cohort |>
  # add demographics variables
  addDemographics() |>
  # add a flag regarding if they had a prior occurrence of pharyngitis
  addConceptIntersectFlag(
    conceptSet = list(pharyngitis = 4112343),
    window = c(-Inf, -1),
    nameStyle = "pharyngitis_before"
  ) 

result <- x |>
  addCohortName() |>
  summariseResult(
    group = "cohort_name",
    includeOverallGroup = FALSE,
    includeOverallStrata = TRUE,
    variables = c("age", "sex", "pharyngitis_before"), 
    estimates = c("mean", "sd"),
    counts = FALSE
  ) 
#> ℹ The following estimates will be computed:
#> • age: mean, sd
#> • pharyngitis_before: mean, sd
#> → Start summary of data, at 2024-04-30 07:27:25.609186
#> ✔ Summary finished, at 2024-04-30 07:27:25.780886

result |>
  count(variable_name, estimate_name)
#> # A tibble: 4 × 3
#>   variable_name      estimate_name     n
#>   <chr>              <chr>         <int>
#> 1 age                mean              1
#> 2 age                sd                1
#> 3 pharyngitis_before mean              1
#> 4 pharyngitis_before sd                1

Created on 2024-04-30 with reprex v2.0.2

catalamarti commented 3 months ago

hi @ablack3 as you can see in your own reprex the estimates that will be calculated are printed:

#> ℹ The following estimates will be computed:
#> • age: mean, sd
#> • pharyngitis_before: mean, sd

What would you expect instead? Variable types and estimates are explained here: https://darwin-eu-dev.github.io/PatientProfiles/articles/summarise.html

ablack3 commented 3 months ago

I guess this makes sense. And I can explicitly set the functions I want to apply to each variable. I think if the user asks for something that is not possible (mean of a categorical variable) an error or warning should be given though.