darwin-eu-dev / omopgenerics

https://darwin-eu-dev.github.io/omopgenerics/
Apache License 2.0
1 stars 1 forks source link

Problems when reading an exportedSummarisedResult csv #354

Closed cebarboza closed 1 month ago

cebarboza commented 3 months ago

Hey,

I am having some problems when reading an exporteSummarised result. When the file is exported all the settings go the bottom, but when reading with read_csv, I get the following parsing warning from vroom:

>     resultsData <- read_csv(filesLocation[i], show_col_types = FALSE)
Warning message:                                                                                                                                                      
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 

After that, info from the settings disappear because the settings are not the same column type.

datapasta::dpasta(resultsData %>% filter(variable_name == "settings"))
#> Error in resultsData %>% filter(variable_name == "settings"): could not find function "%>%"

results <- tibble::tribble(
  ~result_id, ~cdm_name, ~group_name, ~group_level, ~strata_name, ~strata_level, ~variable_name, ~variable_level,      ~estimate_name, ~estimate_type, ~estimate_value, ~additional_name, ~additional_level,
           1,        NA,   "overall",    "overall",    "overall",     "overall",     "settings",              NA,     "result_type.x",    "character",              NA,        "overall",         "overall",
           1,        NA,   "overall",    "overall",    "overall",     "overall",     "settings",              NA,    "package_name.x",    "character",              NA,        "overall",         "overall",
           1,        NA,   "overall",    "overall",    "overall",     "overall",     "settings",              NA, "package_version.x",    "character",              NA,        "overall",         "overall",
           1,        NA,   "overall",    "overall",    "overall",     "overall",     "settings",              NA,   "analysis_type.x",    "character",              NA,        "overall",         "overall",
           1,        NA,   "overall",    "overall",    "overall",     "overall",     "settings",              NA,     "result_type.y",    "character",              NA,        "overall",         "overall",
           1,        NA,   "overall",    "overall",    "overall",     "overall",     "settings",              NA,    "package_name.y",    "character",              NA,        "overall",         "overall",
           1,        NA,   "overall",    "overall",    "overall",     "overall",     "settings",              NA, "package_version.y",    "character",              NA,        "overall",         "overall",
           1,        NA,   "overall",    "overall",    "overall",     "overall",     "settings",              NA,   "analysis_type.y",    "character",              NA,        "overall",         "overall",
           1,        NA,   "overall",    "overall",    "overall",     "overall",     "settings",              NA,    "min_cell_count",      "integer",               5,        "overall",         "overall"
  )

Created on 2024-06-05 with reprex v2.1.0

catalamarti commented 3 months ago

Hi @cebarboza for me it works. How did you obtain that result object?

results <- tibble::tribble(
  ~result_id, ~cdm_name, ~group_name, ~group_level, ~strata_name, ~strata_level, ~variable_name, ~variable_level,      ~estimate_name, ~estimate_type, ~estimate_value, ~additional_name, ~additional_level,
  1,        "my cdm",   "cohort_name",    "acetaminophen",    "overall",     "overall",     "number subjects",              NA,     "count",    "integer",              "1500",        "overall",         "overall",
) |>
  omopgenerics::newSummarisedResult(settings = dplyr::tibble(
    result_id = 1, my_setting = 1, mock_data = FALSE
  ))
#> ! The following column type were changed:
#> • result_id: from double to integer
#> • variable_level: from logical to character

x <- tempdir()
omopgenerics::exportSummarisedResult(results, path = x, fileName = "my_result.csv", minCellCount = 0)

res <- readr::read_csv(file = file.path(x, "my_result.csv")) |> omopgenerics::newSummarisedResult()
#> Rows: 4 Columns: 13
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (11): cdm_name, group_name, group_level, strata_name, strata_level, vari...
#> dbl  (1): result_id
#> lgl  (1): variable_level
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> ! The following column type were changed:
#> • result_id: from double to integer
#> • variable_level: from logical to character

res
#> # A tibble: 1 × 13
#>   result_id cdm_name group_name  group_level   strata_name strata_level
#>       <int> <chr>    <chr>       <chr>         <chr>       <chr>       
#> 1         1 my cdm   cohort_name acetaminophen overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>
summary(res)
#> A summarised_result object with 1 rows, 1 different result_id, 1 different cdm
#> names, and 3 settings.
#> CDM names: my cdm.
#> Settings: my_setting, mock_data, and min_cell_count.
omopgenerics::settings(res)
#> # A tibble: 1 × 4
#>   result_id my_setting mock_data min_cell_count
#>       <int> <chr>      <lgl>              <int>
#> 1         1 1          FALSE                  0

results
#> # A tibble: 1 × 13
#>   result_id cdm_name group_name  group_level   strata_name strata_level
#>       <int> <chr>    <chr>       <chr>         <chr>       <chr>       
#> 1         1 my cdm   cohort_name acetaminophen overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>
summary(results)
#> A summarised_result object with 1 rows, 1 different result_id, 1 different cdm
#> names, and 2 settings.
#> CDM names: my cdm.
#> Settings: my_setting and mock_data.
omopgenerics::settings(results)
#> # A tibble: 1 × 3
#>   result_id my_setting mock_data
#>       <int>      <dbl> <lgl>    
#> 1         1          1 FALSE

Created on 2024-06-07 with reprex v2.1.0

cebarboza commented 3 months ago

Hi, This example is from the documentation in CohortSurvival (0.5.1):

cdmSurvival <- CohortSurvival::mockMGUS2cdm()
singleEvent <- CohortSurvival::estimateSingleEventSurvival(cdmSurvival,
                                                           targetCohortTable = "mgus_diagnosis",
                                                           targetCohortId = 1,
                                                           outcomeCohortTable = "death_cohort",
                                                           outcomeCohortId = 1,
                                                           strata = list(c("age_group"),
                                                                         c("sex"),
                                                                         c("age_group", "sex")))
#> - Getting survival for target cohort 'mgus_diagnosis' and outcome cohort
#> 'death_cohort'
#> Getting overall estimates
#>                           

The estimate_value is numeric, but in the dataframe is character.


dplyr::glimpse(singleEvent)
#> Rows: 10,492
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name         <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name       <chr> "cohort", "cohort", "cohort", "cohort", "cohort", "co…
#> $ group_level      <chr> "mgus_diagnosis", "mgus_diagnosis", "mgus_diagnosis",…
#> $ strata_name      <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level     <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name    <chr> "survival_probability", "survival_probability", "surv…
#> $ variable_level   <chr> "death_cohort", "death_cohort", "death_cohort", "deat…
#> $ estimate_name    <chr> "estimate", "estimate_95CI_lower", "estimate_95CI_upp…
#> $ estimate_type    <chr> "numeric", "numeric", "numeric", "numeric", "numeric"…
#> $ estimate_value   <chr> "1", "1", "1", "0.9697", "0.9607", "0.9787", "0.9494"…
#> $ additional_name  <chr> "time &&& outcome", "time &&& outcome", "time &&& out…
#> $ additional_level <chr> "0 &&& death_cohort", "0 &&& death_cohort", "0 &&& de…

omopgenerics::settings(singleEvent)
#> # A tibble: 1 × 5
#>   result_id result_type package_name   package_version analysis_type
#>       <int> <chr>       <chr>          <chr>           <chr>        
#> 1         1 survival    CohortSurvival 0.5.1           single_event

Also, isn't kind of weird that the settings are sent to the bottom of the data, being inconsistent with all the columns?

image


omopgenerics::exportSummarisedResult(singleEvent)

resultsData <- readr::read_csv(here::here("results_mock_2024_06_07.csv"), show_col_types = FALSE)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)

Then when the data is parsed it seems that vroom just deletes the data in character that doesn't belong to the estimate_value which is parsed as numeric.


tail(resultsData)
#> # A tibble: 6 × 13
#>   result_id cdm_name group_name group_level    strata_name       strata_level
#>       <dbl> <chr>    <chr>      <chr>          <chr>             <chr>       
#> 1         1 mock     cohort     mgus_diagnosis age_group &&& sex >=70 &&& M  
#> 2         1 <NA>     overall    overall        overall           overall     
#> 3         1 <NA>     overall    overall        overall           overall     
#> 4         1 <NA>     overall    overall        overall           overall     
#> 5         1 <NA>     overall    overall        overall           overall     
#> 6         1 <NA>     overall    overall        overall           overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <dbl>,
#> #   additional_name <chr>, additional_level <chr>

image

Created on 2024-06-07 with reprex v2.1.0

catalamarti commented 3 months ago

Hi @cebarboza this is a problem with readr and its assumptions, you can use the argument col_types

cdmSurvival <- CohortSurvival::mockMGUS2cdm()
#> ■■■■■■■■■■■■■■■■■■■ 60% | ETA: 2s
singleEvent <- CohortSurvival::estimateSingleEventSurvival(cdmSurvival,
                                                           targetCohortTable = "mgus_diagnosis",
                                                           targetCohortId = 1,
                                                           outcomeCohortTable = "death_cohort",
                                                           outcomeCohortId = 1,
                                                           strata = list(c("age_group"),
                                                                         c("sex"),
                                                                         c("age_group", "sex")))
#> - Getting survival for target cohort 'mgus_diagnosis' and outcome cohort
#> 'death_cohort'
#> Getting overall estimates
#>                           

omopgenerics::exportSummarisedResult(singleEvent, fileName = "my_data.csv")

resultsData <- readr::read_csv(here::here("my_data.csv"), col_types = c(.default = "c")) |>
  omopgenerics::newSummarisedResult()
#> ! The following column type were changed:
#> • result_id: from character to integer

Created on 2024-06-07 with reprex v2.0.2

cebarboza commented 3 months ago

Well I don't see a problem per se in the assumptions readr is making, since it is doing what is supposed do: warning about those problems in the dataframe.

And the problem remains, data might not be parsed correctly because of the design of the exported summarisedResult.

It is kind of risky because the user might not know that has to force readr to correctly do the import.

Maybe it would be nice to further discuss or to be aware, since it explains how the results are generated and imported back again from the data partners, build shiny apps, etc.

@rossdwilliams @ablack3 @ginberg

ablack3 commented 2 months ago

I think Cesar is thinking about the workflow to serialize, deserialize, and combine summarizedResult object for ReportGenerator.

My understanding is that summarizedResult objects should be able to hold results from any Darwin package and we should be able to row bind them together.

He found one issue when the "estimate_value" column is read in as numeric instead of character.

# omopgenerics::emptySummarisedResult() |> 
  # dplyr::glimpse()

packageVersion("omopgenerics")
#> [1] '0.2.1'

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

cdm1 <- CohortSurvival::mockMGUS2cdm()

attr(cdm1, "cdm_name") <- "cdm1"

singleEvent1 <- CohortSurvival::estimateSingleEventSurvival(cdm1,
                                                           targetCohortTable = "mgus_diagnosis",
                                                           targetCohortId = 1,
                                                           outcomeCohortTable = "death_cohort",
                                                           outcomeCohortId = 1,
                                                           strata = list(c("age_group")))
#> - Getting survival for target cohort 'mgus_diagnosis' and outcome cohort
#> 'death_cohort'
#> Getting overall estimates

omopgenerics::settings(singleEvent)
#> Error in eval(expr, envir, enclos): object 'singleEvent' not found

omopgenerics::exportSummarisedResult(singleEvent1, fileName = "results_cdm1.csv")

# another cdm
cdm2 <- CohortSurvival::mockMGUS2cdm()
attr(cdm2, "cdm_name") <- "cdm2"
singleEvent2 <- CohortSurvival::estimateSingleEventSurvival(cdm2,
                                                           targetCohortTable = "mgus_diagnosis",
                                                           targetCohortId = 1,
                                                           outcomeCohortTable = "death_cohort",
                                                           outcomeCohortId = 1,
                                                           strata = list(c("age_group")))
#> - Getting survival for target cohort 'mgus_diagnosis' and outcome cohort
#> 'death_cohort'
#> Getting overall estimates

omopgenerics::settings(singleEvent2)
#> # A tibble: 1 × 5
#>   result_id result_type package_name   package_version analysis_type
#>       <int> <chr>       <chr>          <chr>           <chr>        
#> 1         1 survival    CohortSurvival 0.5.1           single_event

omopgenerics::exportSummarisedResult(singleEvent2, fileName = "results_cdm2.csv")

# note that we need to increase guess_max so that settings are properly restored

resultsData1 <- readr::read_csv("results_cdm1.csv", show_col_types = FALSE, guess_max = 1e6) |> 
  omopgenerics::newSummarisedResult()
#> ! The following column type were changed:
#> • result_id: from double to integer

resultsData2 <- readr::read_csv("results_cdm2.csv", show_col_types = FALSE, guess_max = 1e6) |> 
  omopgenerics::newSummarisedResult()
#> ! The following column type were changed:
#> • result_id: from double to integer

resultsDataCombined <- omopgenerics::bind(resultsData1, resultsData2)

# both settings are kept and result id is renumbered
omopgenerics::settings(resultsDataCombined)
#> # A tibble: 2 × 6
#>   result_id result_type package_name   package_version analysis_type
#>       <int> <chr>       <chr>          <chr>           <chr>        
#> 1         1 survival    CohortSurvival 0.5.1           single_event 
#> 2         2 survival    CohortSurvival 0.5.1           single_event 
#> # ℹ 1 more variable: min_cell_count <int>

dplyr::count(resultsDataCombined, result_id, cdm_name)
#> # A tibble: 2 × 3
#>   result_id cdm_name     n
#>       <int> <chr>    <int>
#> 1         1 cdm1      3637
#> 2         2 cdm2      3637

# suppose Cesar would like to get settings as columns.
resultsDataCombinedWithSettingsAsColumns <- resultsDataCombined |> 
  dplyr::left_join(omopgenerics::settings(resultsDataCombined), by = "result_id")

# is this still a valid SummarizedResults object and will it work with other methods?

class(resultsDataCombinedWithSettingsAsColumns)
#> [1] "summarised_result" "omop_result"       "tbl_df"           
#> [4] "tbl"               "data.frame"

CohortSurvival::tableSurvival(resultsDataCombinedWithSettingsAsColumns)
CDM name Cohort Age group Outcome name Number records Number events Median survival (95% CI) Restricted mean survival (SE)
cdm1 Mgus diagnosis Overall Death cohort 1,384 963 98.00 (92.00, 103.00) 133.00 (4.00)
<70 Death cohort 574 293 180.00 (158.00, 206.00) 197.00 (8.00)
>=70 Death cohort 810 670 71.00 (66.00, 77.00) 86.00 (3.00)
cdm2 Mgus diagnosis Overall Death cohort 1,384 963 98.00 (92.00, 103.00) 133.00 (4.00)
<70 Death cohort 574 293 180.00 (158.00, 206.00) 197.00 (8.00)
>=70 Death cohort 810 670 71.00 (66.00, 77.00) 86.00 (3.00)

Created on 2024-06-10 with reprex v2.1.0

It Seems like this approach does work. A couple things that could be improved might be to throw an error in newSummarizedResult if settings are missing and possibly provide an importSummarizeResult function and an isSummarizedResult function that would run validation checks.

cebarboza commented 2 months ago

Unfortunately this approach works with CohortSurvival::tableSurvival, but not with CohortCharacteristics::tableLargeScaleCharacteristics. Any suggestion?

packageVersion("omopgenerics")
#> [1] '0.2.1'

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(CDMConnector)
#> Warning: package 'CDMConnector' was built under R version 4.3.3
library(omopgenerics)
#> Warning: package 'omopgenerics' was built under R version 4.3.3
#> 
#> Attaching package: 'omopgenerics'
#> The following objects are masked from 'package:CDMConnector':
#> 
#>     cdmName, recordCohortAttrition, uniqueTableName
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(CohortCharacteristics)
#> Warning: package 'CohortCharacteristics' was built under R version 4.3.3

con <- DBI::dbConnect(duckdb::duckdb(),
                      dbdir = CDMConnector::eunomia_dir()
)

# Result 1

cdm1 <- CDMConnector::cdm_from_con(con,
                                  cdm_schem = "main",
                                  write_schema = "main"
)

cdm1 <- generateConceptCohortSet(
  cdm = cdm1,
  name = "ankle_sprain",
  conceptSet = list("ankle_sprain" = 81151),
  end = "event_end_date",
  limit = "first",
  overwrite = TRUE
)

lsc1 <- cdm1$ankle_sprain |>
  summariseLargeScaleCharacteristics(
    window = list(c(-Inf, -1), c(0, 0)),
    eventInWindow = c(
      "condition_occurrence",
      "procedure_occurrence"
    ),
    episodeInWindow = "drug_exposure",
    minimumFrequency = 0.1
  )
#> ℹ Summarising large scale characteristics
#> - getting characteristics from table condition_occurrence (1 of 3) - getting
#> characteristics from table procedure_occurrence (2 of 3) - getting
#> characteristics from table drug_exposure (3 of 3) 190 estimates dropped as
#> frequency less than 10%

omopgenerics::exportSummarisedResult(lsc1, fileName = "results_cdm1.csv")

# Result 2

cdm2 <- CDMConnector::cdm_from_con(con,
                                   cdm_schem = "main",
                                   write_schema = "main"
)

cdm2 <- generateConceptCohortSet(
  cdm = cdm2,
  name = "ankle_sprain",
  conceptSet = list("ankle_sprain" = 81151),
  end = "event_end_date",
  limit = "first",
  overwrite = TRUE
)

lsc2 <- cdm1$ankle_sprain |>
  summariseLargeScaleCharacteristics(
    window = list(c(-Inf, -1), c(0, 0)),
    eventInWindow = c(
      "condition_occurrence",
      "procedure_occurrence"
    ),
    episodeInWindow = "drug_exposure",
    minimumFrequency = 0.1
  )
#> ℹ Summarising large scale characteristics 
#>  - getting characteristics from table condition_occurrence (1 of 3) - getting characteristics from table procedure_occurrence (2 of 3) - getting characteristics from table drug_exposure (3 of 3)                                                                    190 estimates dropped as frequency less than 10%

omopgenerics::exportSummarisedResult(lsc2, fileName = "results_cdm2.csv")

resultsData1 <- readr::read_csv("results_cdm1.csv", show_col_types = FALSE, guess_max = 1e6) |> 
  omopgenerics::newSummarisedResult()
#> ! The following column type were changed:
#> • result_id: from double to integer

resultsData2 <- readr::read_csv("results_cdm2.csv", show_col_types = FALSE, guess_max = 1e6) |> 
  omopgenerics::newSummarisedResult()
#> ! The following column type were changed:
#> • result_id: from double to integer

resultsDataCombined <- omopgenerics::bind(resultsData1, resultsData2)

omopgenerics::settings(resultsDataCombined)
#> # A tibble: 6 × 8
#>   result_id table_name   type  analysis result_type package_name package_version
#>       <int> <chr>        <chr> <chr>    <chr>       <chr>        <chr>          
#> 1         1 condition_o… event standard summarised… CohortChara… 0.2.1          
#> 2         2 procedure_o… event standard summarised… CohortChara… 0.2.1          
#> 3         3 drug_exposu… epis… standard summarised… CohortChara… 0.2.1          
#> 4         4 condition_o… event standard summarised… CohortChara… 0.2.1          
#> 5         5 procedure_o… event standard summarised… CohortChara… 0.2.1          
#> 6         6 drug_exposu… epis… standard summarised… CohortChara… 0.2.1          
#> # ℹ 1 more variable: min_cell_count <int>

dplyr::count(resultsDataCombined, result_id, cdm_name)
#> # A tibble: 6 × 3
#>   result_id cdm_name                              n
#>       <int> <chr>                             <int>
#> 1         1 Synthea synthetic health database    26
#> 2         2 Synthea synthetic health database     8
#> 3         3 Synthea synthetic health database    36
#> 4         4 Synthea synthetic health database    26
#> 5         5 Synthea synthetic health database     8
#> 6         6 Synthea synthetic health database    36

resultsDataCombinedWithSettingsAsColumns <- resultsDataCombined |> 
  dplyr::left_join(omopgenerics::settings(resultsDataCombined), by = "result_id")

class(resultsDataCombinedWithSettingsAsColumns)
#> [1] "summarised_result" "omop_result"       "tbl_df"           
#> [4] "tbl"               "data.frame"

CohortCharacteristics::tableLargeScaleCharacteristics(resultsDataCombinedWithSettingsAsColumns)
#> Error in `dplyr::mutate()`:
#> ℹ In argument: `estimate_value = dplyr::if_else(...)`.
#> Caused by error in `.data$min_cell_count`:
#> ! Column `min_cell_count` not found in `.data`.

Created on 2024-06-11 with reprex v2.1.0

> rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `dplyr::mutate()`:
ℹ In argument: `estimate_value = dplyr::if_else(...)`.
Caused by error in `.data$min_cell_count`:
! Column `min_cell_count` not found in `.data`.
---
Backtrace:
     ▆
  1. ├─CohortCharacteristics::tableLargeScaleCharacteristics(resultsDataCombinedWithSettingsAsColumns)
  2. │ ├─dplyr::select(...)
  3. │ ├─dplyr::mutate(...)
  4. │ └─dplyr:::mutate.data.frame(...)
  5. │   └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
  6. │     ├─base::withCallingHandlers(...)
  7. │     └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
  8. │       └─mask$eval_all_mutate(quo)
  9. │         └─dplyr (local) eval()
 10. ├─dplyr::if_else(...)
 11. ├─base::paste0("<", .data$min_cell_count)
 12. ├─min_cell_count
 13. ├─rlang:::`$.rlang_data_pronoun`(.data, min_cell_count)
 14. │ └─rlang:::data_pronoun_get(...)
 15. └─rlang:::abort_data_pronoun(x, call = y)
cebarboza commented 2 months ago

Hey guys, what would be the point of moving table_name, type and analysis result_type out of the dataframe, to the settings()? I am just trying to figure out if we have through all this to filter the data for a Shiny app. In summarise_characteristics, you only have this metadata:

> settings(summarised_characteristics)
# A tibble: 5 × 5
  result_id package_name          package_version result_type                min_cell_count
      <int> <chr>                 <chr>           <chr>                               <int>
1         1 CohortCharacteristics 0.2.1           summarised_characteristics              5
2         2 CohortCharacteristics 0.2.1           summarised_characteristics              5
3         3 CohortCharacteristics 0.2.1           summarised_characteristics              5
4         4 CohortCharacteristics 0.2.1           summarised_characteristics              5
5         5 CohortCharacteristics 0.2.1           summarised_characteristics              5