darwin-eu-dev / omopgenerics

https://darwin-eu-dev.github.io/omopgenerics/
Apache License 2.0
2 stars 1 forks source link

exportSummarizedResult/importSummarizedResult and bind remove duplicate rows of results even though the results are for two different cohorts #502

Closed ablack3 closed 1 week ago

ablack3 commented 2 weeks ago

I have multiple summarizedResult objects that I would like to export to a single csv file using exportSummarizedResult from omopgenerics. The issue is that these result have duplicate result_ids and in some cases duplicate rows of data in the result objects.

When I modify the result_id column of the summarizedResult object everything seems ok but export/importSummarizedResult functions don't seem to work properly with these modified summarized result objects.

For now I think I can just export each summarized result unmodified as separate csv files.

Here is a reprex showing some of the unexpected behavior. My expectation would be that I could modify the result_id and still use the export/import functions.


library(CDMConnector)
library(dplyr)
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir())
cdm <- cdm_from_con(con, "main", "main")
#> Note: method with signature 'DBIConnection#Id' chosen for function 'dbExistsTable',
#>  target signature 'duckdb_connection#Id'.
#>  "duckdb_connection#ANY" would also be valid

tempFolder <- tempfile()
dir.create(tempFolder)

# cdm$drug_exposure %>% 
#   count(drug_concept_id, sort = T)
# 
# cdm$concept %>% filter(concept_id == 1112807)

# suppose I have two different summarized result objects
cdm <- generateConceptCohortSet(cdm, conceptSet = list(acetaminophen = 1125315), name = "cohort1")
#> Warning: ! 3 casted column in cohort1 (cohort_attrition) as do not match expected column
#>   type:
#> • `reason_id` from numeric to integer
#> • `excluded_records` from numeric to integer
#> • `excluded_subjects` from numeric to integer
#> Warning: ! 1 casted column in cohort1 (cohort_codelist) as do not match expected column
#>   type:
#> • `concept_id` from numeric to integer
cdm <- generateConceptCohortSet(cdm, conceptSet = list(aspirin = 1112807), name = "cohort2")
#> Warning: ! 3 casted column in cohort2 (cohort_attrition) as do not match expected column
#>   type:
#> • `reason_id` from numeric to integer
#> • `excluded_records` from numeric to integer
#> • `excluded_subjects` from numeric to integer
#> Warning: ! 1 casted column in cohort2 (cohort_codelist) as do not match expected column
#>   type:
#> • `concept_id` from numeric to integer

(summarizedResult1 <- CohortCharacteristics::summariseCharacteristics(cdm$cohort1))
#> # A tibble: 2 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <int> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1         1 Synthea synthetic h… overall    overall     overall     overall     
#> 2         1 Synthea synthetic h… overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>
(summarizedResult2 <- CohortCharacteristics::summariseCharacteristics(cdm$cohort2))
#> # A tibble: 2 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <int> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1         1 Synthea synthetic h… overall    overall     overall     overall     
#> 2         1 Synthea synthetic h… overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>

# I should be able to combine them during export as far as I understand
omopgenerics::exportSummarisedResult(summarizedResult1, summarizedResult2, path = tempFolder)

# I can see that the output only contains two rows even though I saved two different summarized result each with two rows
(files <- list.files(tempFolder, full.names = T))
#> [1] "/var/folders/2j/8z0yfn1j69q8sxjc7vj9yhz40000gp/T//RtmpTxPDmY/file1331a566b3de5/results_Synthea synthetic health database_2024_09_12.csv"
readr::read_csv(files)
#> Rows: 6 Columns: 13
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (11): cdm_name, group_name, group_level, strata_name, strata_level, vari...
#> dbl  (1): result_id
#> lgl  (1): variable_level
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 6 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <dbl> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1         1 Synthea synthetic h… overall    overall     overall     overall     
#> 2         1 Synthea synthetic h… overall    overall     overall     overall     
#> 3         1 <NA>                 overall    overall     overall     overall     
#> 4         1 <NA>                 overall    overall     overall     overall     
#> 5         1 <NA>                 overall    overall     overall     overall     
#> 6         1 <NA>                 overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <lgl>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>

# When I import the results I also see two rows where there should be four
omopgenerics::importSummarisedResult(path = tempFolder)
#> Reading
#> /var/folders/2j/8z0yfn1j69q8sxjc7vj9yhz40000gp/T//RtmpTxPDmY/file1331a566b3de5/results_Synthea
#> synthetic health database_2024_09_12.csv
#> # A tibble: 2 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <int> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1         1 Synthea synthetic h… overall    overall     overall     overall     
#> 2         1 Synthea synthetic h… overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>

# I think this is because the results are exactly the same. 
# really I need to set different result ids for these results so let's try that.

(summarizedResult2fix <- mutate(summarizedResult2, result_id = result_id + 100))
#> # A tibble: 2 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <dbl> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1       101 Synthea synthetic h… overall    overall     overall     overall     
#> 2       101 Synthea synthetic h… overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>

# now the two results have different result_ids
# the class is the same
class(summarizedResult2)
#> [1] "summarised_result" "omop_result"       "tbl_df"           
#> [4] "tbl"               "data.frame"
class(summarizedResult2fix)
#> [1] "summarised_result" "omop_result"       "tbl_df"           
#> [4] "tbl"               "data.frame"

# clean the temp folder
lapply(list.files(tempFolder, full.names = T), file.remove)
#> [[1]]
#> [1] TRUE

# export the result
omopgenerics::exportSummarisedResult(summarizedResult2fix, path = tempFolder)

# If I look at the csv I do see the file but the cdm_name is missing
(files <- list.files(tempFolder, full.names = T))
#> [1] "/var/folders/2j/8z0yfn1j69q8sxjc7vj9yhz40000gp/T//RtmpTxPDmY/file1331a566b3de5/results__2024_09_12.csv"
readr::read_csv(files)
#> Rows: 4 Columns: 13
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (10): group_name, group_level, strata_name, strata_level, variable_name,...
#> dbl  (1): result_id
#> lgl  (2): cdm_name, variable_level
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 4 × 13
#>   result_id cdm_name group_name group_level strata_name strata_level
#>       <dbl> <lgl>    <chr>      <chr>       <chr>       <chr>       
#> 1         1 NA       overall    overall     overall     overall     
#> 2         1 NA       overall    overall     overall     overall     
#> 3         1 NA       overall    overall     overall     overall     
#> 4         1 NA       overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <lgl>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>

# when I import the summarized result no rows 
omopgenerics::importSummarisedResult(path = tempFolder)
#> Reading
#> /var/folders/2j/8z0yfn1j69q8sxjc7vj9yhz40000gp/T//RtmpTxPDmY/file1331a566b3de5/results__2024_09_12.csv
#> # A tibble: 0 × 13
#> # ℹ 13 variables: result_id <int>, cdm_name <chr>, group_name <chr>,
#> #   group_level <chr>, strata_name <chr>, strata_level <chr>,
#> #   variable_name <chr>, variable_level <chr>, estimate_name <chr>,
#> #   estimate_type <chr>, estimate_value <chr>, additional_name <chr>,
#> #   additional_level <chr>

# It looks to me like if I use mutate on a summarize result object that export is not working the way it should

DBI::dbDisconnect(con, shutdown = T)

Created on 2024-09-12 with reprex v2.1.1

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.3 (2024-02-29) #> os macOS Sonoma 14.1 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Amsterdam #> date 2024-09-12 #> pandoc 3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> backports 1.5.0 2024-05-23 [1] CRAN (R 4.3.3) #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0) #> blob 1.2.4 2023-03-17 [1] CRAN (R 4.3.0) #> callr 3.7.6 2024-03-25 [1] CRAN (R 4.3.1) #> CDMConnector * 1.4.0 2024-05-03 [1] CRAN (R 4.3.1) #> checkmate 2.3.2 2024-07-29 [1] CRAN (R 4.3.3) #> cli 3.6.3 2024-06-21 [1] CRAN (R 4.3.3) #> CohortCharacteristics 0.2.2.900 2024-09-12 [1] Github (darwin-eu-dev/CohortCharacteristics@1a74c6a) #> crayon 1.5.3 2024-06-20 [1] CRAN (R 4.3.3) #> curl 5.2.2 2024-08-26 [1] CRAN (R 4.3.3) #> DBI 1.2.3 2024-06-02 [1] CRAN (R 4.3.3) #> dbplyr 2.5.0 2024-03-19 [1] CRAN (R 4.3.1) #> desc 1.4.3 2023-12-10 [1] CRAN (R 4.3.1) #> digest 0.6.37 2024-08-19 [1] CRAN (R 4.3.3) #> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.1) #> duckdb 1.0.0-2 2024-07-19 [1] CRAN (R 4.3.3) #> evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.3.3) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.1) #> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.3.3) #> fs 1.6.4 2024-04-25 [1] CRAN (R 4.3.1) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.1) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0) #> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.3.1) #> jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.1) #> knitr 1.48 2024-07-07 [1] CRAN (R 4.3.3) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0) #> omopgenerics 0.3.0 2024-09-12 [1] Github (darwin-eu-dev/omopgenerics@6d22b15) #> PatientProfiles 1.1.0 2024-06-11 [1] CRAN (R 4.3.3) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0) #> pkgbuild 1.4.4 2024-03-17 [1] CRAN (R 4.3.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0) #> processx 3.8.4 2024-03-16 [1] CRAN (R 4.3.1) #> ps 1.7.7 2024-07-02 [1] CRAN (R 4.3.3) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0) #> readr 2.1.5 2024-01-10 [1] CRAN (R 4.3.1) #> remotes 2.5.0 2024-03-17 [1] CRAN (R 4.3.1) #> reprex 2.1.1 2024-07-06 [1] CRAN (R 4.3.3) #> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.3) #> rmarkdown 2.28 2024-08-17 [1] CRAN (R 4.3.3) #> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0) #> snakecase 0.11.1 2023-08-27 [1] CRAN (R 4.3.0) #> stringi 1.8.4 2024-05-06 [1] CRAN (R 4.3.1) #> stringr 1.5.1 2023-11-14 [1] CRAN (R 4.3.1) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0) #> tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.3.1) #> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.1) #> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1) #> vroom 1.6.5 2023-12-05 [1] CRAN (R 4.3.1) #> withr 3.0.1 2024-07-31 [1] CRAN (R 4.3.3) #> xfun 0.47 2024-08-17 [1] CRAN (R 4.3.3) #> yaml 2.3.10 2024-07-26 [1] CRAN (R 4.3.3) #> #> [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
catalamarti commented 1 week ago

hi @ablack3 I can not understand why you need to modify result_id manually, if done so, there is no equivalence between settings and the results objects and needs settings to be defined. We could add a warning if result_id is modified and drop class.

ablack3 commented 1 week ago

Here is a reprex that I would expect to work. It does not give an error but the combinedSummarizedResult object contains only 2 rows when I would expect it to contain 4 rows (two for each cohort). Do you see the problem in the reprex below?

I was trying to work around this issue by manually modifying the result_id but yes I shouldn't do this becuase it will break the object structure. A warning when the object is modified and the class is dropped seems like a good idea to me. But also I think the export/import functions aren't working correctly because in some cases rows of the summarized results are being dropped.

remotes::install_github("darwin-eu-dev/omopgenerics")
#> Skipping install of 'omopgenerics' from a github remote, the SHA1 (b9ac7f48) has not changed since last install.
#>   Use `force = TRUE` to force installation

library(CDMConnector)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir())
cdm <- cdmFromCon(con, "main", "main")
#> Note: method with signature 'DBIConnection#Id' chosen for function 'dbExistsTable',
#>  target signature 'duckdb_connection#Id'.
#>  "duckdb_connection#ANY" would also be valid
tempFolder <- tempfile()
dir.create(tempFolder)
cdm <- generateConceptCohortSet(cdm, conceptSet = list(acetaminophen = 1125315), name = "cohort1")
#> Warning: ! 3 casted column in cohort1 (cohort_attrition) as do not match expected column
#>   type:
#> • `reason_id` from numeric to integer
#> • `excluded_records` from numeric to integer
#> • `excluded_subjects` from numeric to integer
#> Warning: ! 1 casted column in cohort1 (cohort_codelist) as do not match expected column
#>   type:
#> • `concept_id` from numeric to integer
cdm <- generateConceptCohortSet(cdm, conceptSet = list(aspirin = 1112807), name = "cohort2")
#> Warning: ! 3 casted column in cohort2 (cohort_attrition) as do not match expected column
#>   type:
#> • `reason_id` from numeric to integer
#> • `excluded_records` from numeric to integer
#> • `excluded_subjects` from numeric to integer
#> Warning: ! 1 casted column in cohort2 (cohort_codelist) as do not match expected column
#>   type:
#> • `concept_id` from numeric to integer
(summarizedResult1 <- CohortCharacteristics::summariseCharacteristics(cdm$cohort1))
#> # A tibble: 2 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <int> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1         1 Synthea synthetic h… overall    overall     overall     overall     
#> 2         1 Synthea synthetic h… overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>
(summarizedResult2 <- CohortCharacteristics::summariseCharacteristics(cdm$cohort2))
#> # A tibble: 2 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <int> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1         1 Synthea synthetic h… overall    overall     overall     overall     
#> 2         1 Synthea synthetic h… overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>
omopgenerics::exportSummarisedResult(summarizedResult1, summarizedResult2, path = tempFolder)
#> ! 2 duplicated rows eliminated.

summarizedResultCombined <- omopgenerics::importSummarisedResult(tempFolder)
#> Reading
#> /var/folders/2j/8z0yfn1j69q8sxjc7vj9yhz40000gp/T//Rtmpx4xRwF/filee00d4f9928cf/results_Synthea
#> synthetic health database_2024_09_19.csv

summarizedResult1
#> # A tibble: 2 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <int> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1         1 Synthea synthetic h… overall    overall     overall     overall     
#> 2         1 Synthea synthetic h… overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>

summarizedResult2
#> # A tibble: 2 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <int> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1         1 Synthea synthetic h… overall    overall     overall     overall     
#> 2         1 Synthea synthetic h… overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>

# shouldn't this combined results object have 4 rows?
summarizedResultCombined
#> # A tibble: 2 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <int> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1         1 Synthea synthetic h… overall    overall     overall     overall     
#> 2         1 Synthea synthetic h… overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>

cdmDisconnect(cdm)

Created on 2024-09-19 with reprex v2.1.1

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.3 (2024-02-29) #> os macOS Sonoma 14.1 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Amsterdam #> date 2024-09-19 #> pandoc 3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> backports 1.5.0 2024-05-23 [1] CRAN (R 4.3.3) #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0) #> blob 1.2.4 2023-03-17 [1] CRAN (R 4.3.0) #> CDMConnector * 1.4.0 2024-05-03 [1] CRAN (R 4.3.1) #> checkmate 2.3.2 2024-07-29 [1] CRAN (R 4.3.3) #> cli 3.6.3 2024-06-21 [1] CRAN (R 4.3.3) #> CohortCharacteristics 0.2.2.900 2024-09-12 [1] Github (darwin-eu-dev/CohortCharacteristics@1a74c6a) #> crayon 1.5.3 2024-06-20 [1] CRAN (R 4.3.3) #> curl 5.2.2 2024-08-26 [1] CRAN (R 4.3.3) #> DBI 1.2.3 2024-06-02 [1] CRAN (R 4.3.3) #> dbplyr 2.5.0 2024-03-19 [1] CRAN (R 4.3.1) #> digest 0.6.37 2024-08-19 [1] CRAN (R 4.3.3) #> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.1) #> duckdb 1.0.0-2 2024-07-19 [1] CRAN (R 4.3.3) #> evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.3.3) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.1) #> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.3.3) #> fs 1.6.4 2024-04-25 [1] CRAN (R 4.3.1) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.1) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0) #> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.3.1) #> jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.1) #> knitr 1.48 2024-07-07 [1] CRAN (R 4.3.3) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0) #> omopgenerics 0.3.0.900 2024-09-19 [1] Github (darwin-eu-dev/omopgenerics@b9ac7f4) #> PatientProfiles 1.1.0 2024-06-11 [1] CRAN (R 4.3.3) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0) #> readr 2.1.5 2024-01-10 [1] CRAN (R 4.3.1) #> remotes 2.5.0 2024-03-17 [1] CRAN (R 4.3.1) #> reprex 2.1.1 2024-07-06 [1] CRAN (R 4.3.3) #> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.3) #> rmarkdown 2.28 2024-08-17 [1] CRAN (R 4.3.3) #> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0) #> snakecase 0.11.1 2023-08-27 [1] CRAN (R 4.3.0) #> stringi 1.8.4 2024-05-06 [1] CRAN (R 4.3.1) #> stringr 1.5.1 2023-11-14 [1] CRAN (R 4.3.1) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0) #> tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.3.1) #> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.1) #> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1) #> vroom 1.6.5 2023-12-05 [1] CRAN (R 4.3.1) #> withr 3.0.1 2024-07-31 [1] CRAN (R 4.3.3) #> xfun 0.47 2024-08-17 [1] CRAN (R 4.3.3) #> yaml 2.3.10 2024-07-26 [1] CRAN (R 4.3.3) #> #> [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
image

I should be able to pass multiple summarizeResult objects into the export function and then import them without losing any rows as far as I understand.

I think the reason is that the results are identical so they are considered to be duplicates. But in reality these are results for two different cohorts so we want to keep all 4 rows. The result_id should differentiate the two sets of results.

We should have a general approach for renumbering ids when results are combined. We have the same situation when we combine to cohort tables together since there may be duplicate cohort definition ids. However if we change the ids then we have to be careful because there may be other code or files that depend on the ids.

image
ablack3 commented 1 week ago

We have the same issue with bind

library(CDMConnector)
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir())
cdm <- cdmFromCon(con, "main", "main")
tempFolder <- tempfile()
dir.create(tempFolder)
cdm <- generateConceptCohortSet(cdm, conceptSet = list(acetaminophen = 1125315), name = "cohort1")
cdm <- generateConceptCohortSet(cdm, conceptSet = list(aspirin = 1112807), name = "cohort2")
(summarizedResult1 <- CohortCharacteristics::summariseCharacteristics(cdm$cohort1))
#> # A tibble: 2 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <int> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1         1 Synthea synthetic h… overall    overall     overall     overall     
#> 2         1 Synthea synthetic h… overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>
(summarizedResult2 <- CohortCharacteristics::summariseCharacteristics(cdm$cohort2))
#> # A tibble: 2 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <int> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1         1 Synthea synthetic h… overall    overall     overall     overall     
#> 2         1 Synthea synthetic h… overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>
omopgenerics::bind(summarizedResult1, summarizedResult2)
#> ! 2 duplicated rows eliminated.
#> # A tibble: 2 × 13
#>   result_id cdm_name             group_name group_level strata_name strata_level
#>       <int> <chr>                <chr>      <chr>       <chr>       <chr>       
#> 1         1 Synthea synthetic h… overall    overall     overall     overall     
#> 2         1 Synthea synthetic h… overall    overall     overall     overall     
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>
cdmDisconnect(cdm)

Created on 2024-09-19 with reprex v2.1.1

2 duplicate rows were eliminated but these are actually results from two different cohorts so we should keep them.

catalamarti commented 1 week ago

I see that as an edge case, the problem is that they are empty cohorts... I would suggest that CohortCharacteristics should include:

I don't see a wrong behaviour of the bind function here as:

identical(summarizedResult1, summarizedResult2)
#> [1] TRUE

For me the problem is that they shouldn't be identical, so I would suggest to transfer this issue to CohortCharacteristics

ablack3 commented 1 week ago

For me the problem is that they shouldn't be identical, so I would suggest to transfer this issue to CohortCharacteristics

Yes I agree. identical(summarizedResult1, summarizedResult2) should return FALSE because they are results from different cohorts.

I opened an issue here https://github.com/darwin-eu-dev/CohortCharacteristics/issues/170