Closed ablack3 closed 1 week ago
hi @ablack3 I can not understand why you need to modify result_id manually, if done so, there is no equivalence between settings and the results objects and
Here is a reprex that I would expect to work. It does not give an error but the combinedSummarizedResult
object contains only 2 rows when I would expect it to contain 4 rows (two for each cohort). Do you see the problem in the reprex below?
I was trying to work around this issue by manually modifying the result_id but yes I shouldn't do this becuase it will break the object structure. A warning when the object is modified and the class is dropped seems like a good idea to me. But also I think the export/import functions aren't working correctly because in some cases rows of the summarized results are being dropped.
remotes::install_github("darwin-eu-dev/omopgenerics")
#> Skipping install of 'omopgenerics' from a github remote, the SHA1 (b9ac7f48) has not changed since last install.
#> Use `force = TRUE` to force installation
library(CDMConnector)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir())
cdm <- cdmFromCon(con, "main", "main")
#> Note: method with signature 'DBIConnection#Id' chosen for function 'dbExistsTable',
#> target signature 'duckdb_connection#Id'.
#> "duckdb_connection#ANY" would also be valid
tempFolder <- tempfile()
dir.create(tempFolder)
cdm <- generateConceptCohortSet(cdm, conceptSet = list(acetaminophen = 1125315), name = "cohort1")
#> Warning: ! 3 casted column in cohort1 (cohort_attrition) as do not match expected column
#> type:
#> • `reason_id` from numeric to integer
#> • `excluded_records` from numeric to integer
#> • `excluded_subjects` from numeric to integer
#> Warning: ! 1 casted column in cohort1 (cohort_codelist) as do not match expected column
#> type:
#> • `concept_id` from numeric to integer
cdm <- generateConceptCohortSet(cdm, conceptSet = list(aspirin = 1112807), name = "cohort2")
#> Warning: ! 3 casted column in cohort2 (cohort_attrition) as do not match expected column
#> type:
#> • `reason_id` from numeric to integer
#> • `excluded_records` from numeric to integer
#> • `excluded_subjects` from numeric to integer
#> Warning: ! 1 casted column in cohort2 (cohort_codelist) as do not match expected column
#> type:
#> • `concept_id` from numeric to integer
(summarizedResult1 <- CohortCharacteristics::summariseCharacteristics(cdm$cohort1))
#> # A tibble: 2 × 13
#> result_id cdm_name group_name group_level strata_name strata_level
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Synthea synthetic h… overall overall overall overall
#> 2 1 Synthea synthetic h… overall overall overall overall
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> # estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> # additional_name <chr>, additional_level <chr>
(summarizedResult2 <- CohortCharacteristics::summariseCharacteristics(cdm$cohort2))
#> # A tibble: 2 × 13
#> result_id cdm_name group_name group_level strata_name strata_level
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Synthea synthetic h… overall overall overall overall
#> 2 1 Synthea synthetic h… overall overall overall overall
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> # estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> # additional_name <chr>, additional_level <chr>
omopgenerics::exportSummarisedResult(summarizedResult1, summarizedResult2, path = tempFolder)
#> ! 2 duplicated rows eliminated.
summarizedResultCombined <- omopgenerics::importSummarisedResult(tempFolder)
#> Reading
#> /var/folders/2j/8z0yfn1j69q8sxjc7vj9yhz40000gp/T//Rtmpx4xRwF/filee00d4f9928cf/results_Synthea
#> synthetic health database_2024_09_19.csv
summarizedResult1
#> # A tibble: 2 × 13
#> result_id cdm_name group_name group_level strata_name strata_level
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Synthea synthetic h… overall overall overall overall
#> 2 1 Synthea synthetic h… overall overall overall overall
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> # estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> # additional_name <chr>, additional_level <chr>
summarizedResult2
#> # A tibble: 2 × 13
#> result_id cdm_name group_name group_level strata_name strata_level
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Synthea synthetic h… overall overall overall overall
#> 2 1 Synthea synthetic h… overall overall overall overall
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> # estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> # additional_name <chr>, additional_level <chr>
# shouldn't this combined results object have 4 rows?
summarizedResultCombined
#> # A tibble: 2 × 13
#> result_id cdm_name group_name group_level strata_name strata_level
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Synthea synthetic h… overall overall overall overall
#> 2 1 Synthea synthetic h… overall overall overall overall
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> # estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> # additional_name <chr>, additional_level <chr>
cdmDisconnect(cdm)
Created on 2024-09-19 with reprex v2.1.1
I should be able to pass multiple summarizeResult objects into the export function and then import them without losing any rows as far as I understand.
I think the reason is that the results are identical so they are considered to be duplicates. But in reality these are results for two different cohorts so we want to keep all 4 rows. The result_id should differentiate the two sets of results.
We should have a general approach for renumbering ids when results are combined. We have the same situation when we combine to cohort tables together since there may be duplicate cohort definition ids. However if we change the ids then we have to be careful because there may be other code or files that depend on the ids.
We have the same issue with bind
library(CDMConnector)
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir())
cdm <- cdmFromCon(con, "main", "main")
tempFolder <- tempfile()
dir.create(tempFolder)
cdm <- generateConceptCohortSet(cdm, conceptSet = list(acetaminophen = 1125315), name = "cohort1")
cdm <- generateConceptCohortSet(cdm, conceptSet = list(aspirin = 1112807), name = "cohort2")
(summarizedResult1 <- CohortCharacteristics::summariseCharacteristics(cdm$cohort1))
#> # A tibble: 2 × 13
#> result_id cdm_name group_name group_level strata_name strata_level
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Synthea synthetic h… overall overall overall overall
#> 2 1 Synthea synthetic h… overall overall overall overall
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> # estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> # additional_name <chr>, additional_level <chr>
(summarizedResult2 <- CohortCharacteristics::summariseCharacteristics(cdm$cohort2))
#> # A tibble: 2 × 13
#> result_id cdm_name group_name group_level strata_name strata_level
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Synthea synthetic h… overall overall overall overall
#> 2 1 Synthea synthetic h… overall overall overall overall
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> # estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> # additional_name <chr>, additional_level <chr>
omopgenerics::bind(summarizedResult1, summarizedResult2)
#> ! 2 duplicated rows eliminated.
#> # A tibble: 2 × 13
#> result_id cdm_name group_name group_level strata_name strata_level
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Synthea synthetic h… overall overall overall overall
#> 2 1 Synthea synthetic h… overall overall overall overall
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> # estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> # additional_name <chr>, additional_level <chr>
cdmDisconnect(cdm)
Created on 2024-09-19 with reprex v2.1.1
2 duplicate rows were eliminated but these are actually results from two different cohorts so we should keep them.
I see that as an edge case, the problem is that they are empty cohorts... I would suggest that CohortCharacteristics should include:
I don't see a wrong behaviour of the bind function here as:
identical(summarizedResult1, summarizedResult2)
#> [1] TRUE
For me the problem is that they shouldn't be identical, so I would suggest to transfer this issue to CohortCharacteristics
For me the problem is that they shouldn't be identical, so I would suggest to transfer this issue to CohortCharacteristics
Yes I agree. identical(summarizedResult1, summarizedResult2)
should return FALSE because they are results from different cohorts.
I opened an issue here https://github.com/darwin-eu-dev/CohortCharacteristics/issues/170
I have multiple summarizedResult objects that I would like to export to a single csv file using exportSummarizedResult from omopgenerics. The issue is that these result have duplicate result_ids and in some cases duplicate rows of data in the result objects.
When I modify the result_id column of the summarizedResult object everything seems ok but export/importSummarizedResult functions don't seem to work properly with these modified summarized result objects.
For now I think I can just export each summarized result unmodified as separate csv files.
Here is a reprex showing some of the unexpected behavior. My expectation would be that I could modify the result_id and still use the export/import functions.
Created on 2024-09-12 with reprex v2.1.1
Session info
``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.3 (2024-02-29) #> os macOS Sonoma 14.1 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Amsterdam #> date 2024-09-12 #> pandoc 3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> backports 1.5.0 2024-05-23 [1] CRAN (R 4.3.3) #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0) #> blob 1.2.4 2023-03-17 [1] CRAN (R 4.3.0) #> callr 3.7.6 2024-03-25 [1] CRAN (R 4.3.1) #> CDMConnector * 1.4.0 2024-05-03 [1] CRAN (R 4.3.1) #> checkmate 2.3.2 2024-07-29 [1] CRAN (R 4.3.3) #> cli 3.6.3 2024-06-21 [1] CRAN (R 4.3.3) #> CohortCharacteristics 0.2.2.900 2024-09-12 [1] Github (darwin-eu-dev/CohortCharacteristics@1a74c6a) #> crayon 1.5.3 2024-06-20 [1] CRAN (R 4.3.3) #> curl 5.2.2 2024-08-26 [1] CRAN (R 4.3.3) #> DBI 1.2.3 2024-06-02 [1] CRAN (R 4.3.3) #> dbplyr 2.5.0 2024-03-19 [1] CRAN (R 4.3.1) #> desc 1.4.3 2023-12-10 [1] CRAN (R 4.3.1) #> digest 0.6.37 2024-08-19 [1] CRAN (R 4.3.3) #> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.3.1) #> duckdb 1.0.0-2 2024-07-19 [1] CRAN (R 4.3.3) #> evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.3.3) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.1) #> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.3.3) #> fs 1.6.4 2024-04-25 [1] CRAN (R 4.3.1) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.1) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0) #> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.3.1) #> jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.3.1) #> knitr 1.48 2024-07-07 [1] CRAN (R 4.3.3) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0) #> omopgenerics 0.3.0 2024-09-12 [1] Github (darwin-eu-dev/omopgenerics@6d22b15) #> PatientProfiles 1.1.0 2024-06-11 [1] CRAN (R 4.3.3) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0) #> pkgbuild 1.4.4 2024-03-17 [1] CRAN (R 4.3.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0) #> processx 3.8.4 2024-03-16 [1] CRAN (R 4.3.1) #> ps 1.7.7 2024-07-02 [1] CRAN (R 4.3.3) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0) #> readr 2.1.5 2024-01-10 [1] CRAN (R 4.3.1) #> remotes 2.5.0 2024-03-17 [1] CRAN (R 4.3.1) #> reprex 2.1.1 2024-07-06 [1] CRAN (R 4.3.3) #> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.3) #> rmarkdown 2.28 2024-08-17 [1] CRAN (R 4.3.3) #> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0) #> snakecase 0.11.1 2023-08-27 [1] CRAN (R 4.3.0) #> stringi 1.8.4 2024-05-06 [1] CRAN (R 4.3.1) #> stringr 1.5.1 2023-11-14 [1] CRAN (R 4.3.1) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0) #> tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.3.1) #> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.1) #> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1) #> vroom 1.6.5 2023-12-05 [1] CRAN (R 4.3.1) #> withr 3.0.1 2024-07-31 [1] CRAN (R 4.3.3) #> xfun 0.47 2024-08-17 [1] CRAN (R 4.3.3) #> yaml 2.3.10 2024-07-26 [1] CRAN (R 4.3.3) #> #> [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```