OHDSI / DataQualityDashboard

A tool to help improve data quality standards in observational data science.
https://ohdsi.github.io/DataQualityDashboard
Apache License 2.0
141 stars 93 forks source link

exectuteDqChecks fails if cdm_source table has more than one row and outputFile is left as default #475

Closed ablack3 closed 6 months ago

ablack3 commented 1 year ago

Here is a reproducible code example showing that

  1. executeDqChecks runs successfully on Eunomia as is
  2. When the Eunomia cdm_source table has 2 rows (which can happen in real CDMs) executeDqChecks fails.
# remotes::install_github("ohdsi/Eunomia", "v1.0.2")
# remotes::install_github("ohdsi/DataQualityDashboard", force = T)

connectionDetails <- Eunomia::getEunomiaConnectionDetails()

# demonstrate that this works as is
result <- DataQualityDashboard::executeDqChecks(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = "main",
  resultsDatabaseSchema = "main",
  cdmSourceName = "GiBleed_Eunomia",
  checkLevels = "TABLE",
  checkNames = c("cdmTable", "cdmField"),
  outputFolder = "~/Desktop/example_dqd_output"
)
#> Warning in DataQualityDashboard::executeDqChecks(connectionDetails =
#> connectionDetails, : Missing check names to calculate the 'Not Applicable'
#> status: measureValueCompleteness
#> Connecting using SQLite driver
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger. (This message will not be shown again this R session)
#> 
#> Rows: 22 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (8): checkLevel, checkName, checkDescription, kahnContext, kahnCategory,...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> CDM Tables skipped: CONCEPT, VOCABULARY, CONCEPT_ANCESTOR, CONCEPT_RELATIONSHIP, CONCEPT_CLASS, CONCEPT_SYNONYM, RELATIONSHIP, DOMAIN
#> Connecting using SQLite driver
#> Processing check description: cdmTable
#> Writing results to file: ~/Desktop/example_dqd_output/synthea-20230728161032.json
#> Execution Complete
#> Connecting using SQLite driver
#> Writing results to table main.dqdashboard_results
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.00458 secs
#> Warning: Column 'warning' is of type 'logical', but this is not supported by
#> many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'error' is of type 'logical', but this is not supported by many
#> DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'threshold_value' is of type 'logical', but this is not
#> supported by many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'notes_value' is of type 'logical', but this is not supported
#> by many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Inserting data took 0.0185 secs
#> Finished writing table

conn <- DatabaseConnector::connect(connectionDetails)
#> Connecting using SQLite driver
library(DatabaseConnector)
getTableNames(conn)
#>  [1] "care_site"                "cdm_source"              
#>  [3] "cohort"                   "cohort_attribute"        
#>  [5] "concept"                  "concept_ancestor"        
#>  [7] "concept_class"            "concept_relationship"    
#>  [9] "concept_synonym"          "condition_era"           
#> [11] "condition_occurrence"     "cost"                    
#> [13] "death"                    "device_exposure"         
#> [15] "domain"                   "dose_era"                
#> [17] "drug_era"                 "drug_exposure"           
#> [19] "drug_strength"            "fact_relationship"       
#> [21] "location"                 "measurement"             
#> [23] "metadata"                 "note"                    
#> [25] "note_nlp"                 "observation"             
#> [27] "observation_period"       "payer_plan_period"       
#> [29] "person"                   "procedure_occurrence"    
#> [31] "provider"                 "relationship"            
#> [33] "source_to_concept_map"    "specimen"                
#> [35] "visit_detail"             "visit_occurrence"        
#> [37] "vocabulary"               "dqdashboard_results"     
#> [39] "main.dqdashboard_results"
cdm_source <- DatabaseConnector::querySql(conn, "select * from cdm_source")

# duplicate the rows
cdm_source <- rbind(cdm_source, cdm_source)

# now cdm_source has two rows 
tibble::tibble(cdm_source)
#> # A tibble: 2 × 10
#>   CDM_SOURCE_NAME           CDM_SOURCE_ABBREVIAT…¹ CDM_HOLDER SOURCE_DESCRIPTION
#>   <chr>                     <chr>                  <chr>      <chr>             
#> 1 Synthea synthetic health… Synthea                OHDSI Com… SyntheaTM is a Sy…
#> 2 Synthea synthetic health… Synthea                OHDSI Com… SyntheaTM is a Sy…
#> # ℹ abbreviated name: ¹​CDM_SOURCE_ABBREVIATION
#> # ℹ 6 more variables: SOURCE_DOCUMENTATION_REFERENCE <chr>,
#> #   CDM_ETL_REFERENCE <chr>, SOURCE_RELEASE_DATE <date>,
#> #   CDM_RELEASE_DATE <date>, CDM_VERSION <chr>, VOCABULARY_VERSION <chr>

# overwrite the cdm_source table with a new one that has two rows
insertTable(conn, 
            databaseSchema = "main",
            tableName = "cdm_source",
            data = cdm_source)
#> Inserting data took 0.0274 secs

tibble::tibble(querySql(conn, "select * from cdm_source"))
#> # A tibble: 2 × 10
#>   CDM_SOURCE_NAME           CDM_SOURCE_ABBREVIAT…¹ CDM_HOLDER SOURCE_DESCRIPTION
#>   <chr>                     <chr>                  <chr>      <chr>             
#> 1 Synthea synthetic health… Synthea                OHDSI Com… SyntheaTM is a Sy…
#> 2 Synthea synthetic health… Synthea                OHDSI Com… SyntheaTM is a Sy…
#> # ℹ abbreviated name: ¹​CDM_SOURCE_ABBREVIATION
#> # ℹ 6 more variables: SOURCE_DOCUMENTATION_REFERENCE <chr>,
#> #   CDM_ETL_REFERENCE <chr>, SOURCE_RELEASE_DATE <date>,
#> #   CDM_RELEASE_DATE <date>, CDM_VERSION <chr>, VOCABULARY_VERSION <chr>
disconnect(conn)

# now try running DQD
result <- DataQualityDashboard::executeDqChecks(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = "main",
  resultsDatabaseSchema = "main",
  cdmSourceName = "GiBleed_Eunomia",
  checkLevels = "TABLE",
  checkNames = c("cdmTable", "cdmField"),
  outputFolder = "~/Desktop/example_dqd_output"
)
#> Warning in DataQualityDashboard::executeDqChecks(connectionDetails =
#> connectionDetails, : Missing check names to calculate the 'Not Applicable'
#> status: measureValueCompleteness
#> Connecting using SQLite driver
#> Rows: 22 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (8): checkLevel, checkName, checkDescription, kahnContext, kahnCategory,...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> CDM Tables skipped: CONCEPT, VOCABULARY, CONCEPT_ANCESTOR, CONCEPT_RELATIONSHIP, CONCEPT_CLASS, CONCEPT_SYNONYM, RELATIONSHIP, DOMAIN
#> Connecting using SQLite driver
#> Processing check description: cdmTable
#> Writing results to file: ~/Desktop/example_dqd_output/synthea-20230728161033.jsonWriting results to file: ~/Desktop/example_dqd_output/synthea-20230728161033.json
#> Error in if (file == "") file <- stdout() else if (startsWith(file, "|")) {: the condition has length > 1

Created on 2023-07-28 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.2 (2022-10-31) #> os macOS Big Sur ... 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/New_York #> date 2023-07-28 #> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0) #> bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.0) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0) #> blob 1.2.4 2023-03-17 [1] CRAN (R 4.2.0) #> cachem 1.0.8 2023-05-01 [1] CRAN (R 4.2.0) #> checkmate 2.2.0 2023-04-27 [1] CRAN (R 4.2.0) #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0) #> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0) #> DatabaseConnector * 6.2.3 2023-06-29 [1] CRAN (R 4.2.0) #> DataQualityDashboard 2.4.0 2023-07-28 [1] Github (ohdsi/DataQualityDashboard@e18c308) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0) #> digest 0.6.33 2023-07-07 [1] CRAN (R 4.2.0) #> dplyr 1.1.2 2023-04-20 [1] CRAN (R 4.2.0) #> Eunomia 1.0.2 2023-07-28 [1] Github (ohdsi/Eunomia@e330860) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.2.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.0) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.2.0) #> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.0) #> jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.2.0) #> knitr 1.43 2023-05-25 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.0) #> ParallelLogger 3.2.0 2023-07-01 [1] Github (ohdsi/ParallelLogger@a239b8f) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> readr 2.1.4 2023-02-10 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rJava 1.0-6 2021-12-10 [1] CRAN (R 4.2.0) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.0) #> rmarkdown 2.23 2023-07-01 [1] CRAN (R 4.2.0) #> RSQLite 2.3.1 2023-04-03 [1] CRAN (R 4.2.0) #> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> SqlRender 1.15.2 2023-07-27 [1] Github (ohdsi/SqlRender@d841ea7) #> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0) #> stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.0) #> styler 1.10.1 2023-06-05 [1] CRAN (R 4.2.0) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.2.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0) #> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.0) #> vroom 1.6.3 2023-04-28 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.39 2023-04-20 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
ablack3 commented 1 year ago

This problem occurs because the cdm abbreviation is being taken from the cdm_source table. When there are two abbreviations the resultFilename will be a length 2 character vector (here) which causes write to fail here.

My suggestion would be to only make sure that the cdm abbreviation is a length 1 character vector and the output file name is a length 1 character vector.

ablack3 commented 1 year ago

Also note that this function creates a table in the main schema called "main.dqdashboard_results" which is probably not what is intended.

connectionDetails <- Eunomia::getEunomiaConnectionDetails()

# demonstrate that this works as is
result <- DataQualityDashboard::executeDqChecks(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = "main",
  resultsDatabaseSchema = "main",
  cdmSourceName = "GiBleed_Eunomia",
  checkLevels = "TABLE",
  checkNames = c("cdmTable", "cdmField"),
  outputFolder = "~/Desktop/example_dqd_output"
)
#> Warning in DataQualityDashboard::executeDqChecks(connectionDetails =
#> connectionDetails, : Missing check names to calculate the 'Not Applicable'
#> status: measureValueCompleteness
#> Connecting using SQLite driver
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger. (This message will not be shown again this R session)
#> 
#> Rows: 22 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (8): checkLevel, checkName, checkDescription, kahnContext, kahnCategory,...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> CDM Tables skipped: CONCEPT, VOCABULARY, CONCEPT_ANCESTOR, CONCEPT_RELATIONSHIP, CONCEPT_CLASS, CONCEPT_SYNONYM, RELATIONSHIP, DOMAIN
#> Connecting using SQLite driver
#> Processing check description: cdmTable
#> Writing results to file: ~/Desktop/example_dqd_output/synthea-20230728161501.json
#> Execution Complete
#> Connecting using SQLite driver
#> Writing results to table main.dqdashboard_results
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.00429 secs
#> Warning: Column 'warning' is of type 'logical', but this is not supported by
#> many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'error' is of type 'logical', but this is not supported by many
#> DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'threshold_value' is of type 'logical', but this is not
#> supported by many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'notes_value' is of type 'logical', but this is not supported
#> by many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Inserting data took 0.0174 secs
#> Finished writing table

conn <- DatabaseConnector::connect(connectionDetails)
#> Connecting using SQLite driver
library(DatabaseConnector)
getTableNames(conn)
#>  [1] "care_site"                "cdm_source"              
#>  [3] "cohort"                   "cohort_attribute"        
#>  [5] "concept"                  "concept_ancestor"        
#>  [7] "concept_class"            "concept_relationship"    
#>  [9] "concept_synonym"          "condition_era"           
#> [11] "condition_occurrence"     "cost"                    
#> [13] "death"                    "device_exposure"         
#> [15] "domain"                   "dose_era"                
#> [17] "drug_era"                 "drug_exposure"           
#> [19] "drug_strength"            "fact_relationship"       
#> [21] "location"                 "measurement"             
#> [23] "metadata"                 "note"                    
#> [25] "note_nlp"                 "observation"             
#> [27] "observation_period"       "payer_plan_period"       
#> [29] "person"                   "procedure_occurrence"    
#> [31] "provider"                 "relationship"            
#> [33] "source_to_concept_map"    "specimen"                
#> [35] "visit_detail"             "visit_occurrence"        
#> [37] "vocabulary"               "dqdashboard_results"     
#> [39] "main.dqdashboard_results"

tibble::tibble(querySql(conn, "select * from main.'main.dqdashboard_results'"))
#> # A tibble: 23 × 26
#>    NUM_VIOLATED_ROWS PCT_VIOLATED_ROWS NUM_DENOMINATOR_ROWS EXECUTION_TIME
#>                <dbl>             <dbl>                <dbl> <chr>         
#>  1                 0                 0                    1 0.004840 secs 
#>  2                 0                 0                    1 0.004684 secs 
#>  3                 0                 0                    1 0.005634 secs 
#>  4                 0                 0                    1 0.005763 secs 
#>  5                 0                 0                    1 0.005449 secs 
#>  6                 0                 0                    1 0.003824 secs 
#>  7                 0                 0                    1 0.002964 secs 
#>  8                 0                 0                    1 0.004513 secs 
#>  9                 0                 0                    1 0.003135 secs 
#> 10                 0                 0                    1 0.003447 secs 
#> # ℹ 13 more rows
#> # ℹ 22 more variables: QUERY_TEXT <chr>, CHECK_NAME <chr>, CHECK_LEVEL <chr>,
#> #   CHECK_DESCRIPTION <chr>, CDM_TABLE_NAME <chr>, CDM_FIELD_NAME <chr>,
#> #   CONCEPT_ID <chr>, UNIT_CONCEPT_ID <chr>, SQL_FILE <chr>, CATEGORY <chr>,
#> #   SUBCATEGORY <chr>, CONTEXT <chr>, WARNING <dbl>, ERROR <dbl>,
#> #   CHECKID <chr>, FAILED <dbl>, PASSED <dbl>, IS_ERROR <dbl>,
#> #   NOT_APPLICABLE <dbl>, NOT_APPLICABLE_REASON <chr>, THRESHOLD_VALUE <dbl>, …

Created on 2023-07-28 with reprex v2.0.2

ablack3 commented 1 year ago

Also just pointing out that this function by defaults returns it's results in three different ways (two side effects and one return value):

  1. Results are returned from the function as an R dataframe
  2. Results are also written to a file
  3. Results are also loaded from R into a database table

I might suggest making the return value invisible as recommended here.

katy-sadowski commented 1 year ago

Thanks for the detailed report, @ablack3 !

The CDM_SOURCE error is a known issue (#410) and is actually next in my queue to fix. In my response to the previous issue, I took a stance where DQD would enforce CDM_SOURCE only containing 1 row. However, the more I think about it, given there's not a hard and fast convention on this, I think I'd be open to allowing it, with a warning. We have 2 options in that case:

I prefer the 1st option, because if someone has more than a few rows in CDM_SOURCE their metadata is going to look very messy. I also do think that, in its current form, the CDM_SOURCE is really only designed to hold a single row, and anyone storing multiple rows in there is sort of "hacking" the CDM. So it'll be difficult to anticipate all of ways those folks are populating the table, at least until there's an official convention for this.

There is also a 3rd option, which is allowing users to input the metadata parameters as an alternative to pulling from CDM_SOURCE. I believe this was previously discussed and decided against, though, so I'd probably want to run it by the DQ working group before pursuing that.

What do you think?

PS - FYI there's a second issue beyond the output file naming which is the metadata passed into the results file & displayed in the Shiny app UI (https://github.com/OHDSI/DataQualityDashboard/blob/e18c308a141d381e65343472486ad02ad9904dad/R/executeDqChecks.R#L308-L315)

katy-sadowski commented 1 year ago

Also just pointing out that this function by defaults returns it's results in three different ways (two side effects and one return value):

  1. Results are returned from the function as an R dataframe
  2. Results are also written to a file
  3. Results are also loaded from R into a database table

I might suggest making the return value invisible as recommended here.

Concur with making return value invisible. I'll file a separate issue for that.

I also agree that we shouldn't be writing to the database by default; the default value for writeToTable should be FALSE and/or we should force users to specify the database table name (the same way we require a value for outputFolder). However, I think we will need to wait until the next major release to make this change as it could be breaking for some users expecting this default behavior. In the meantime I can update the docs to make it clear the default mode will write to the database.

ablack3 commented 1 year ago

Thanks @katy-sadowski! I think either option would work for me. In general the issue is about being clear and strict with our interfaces and expected structure of data types. Is there a CDM convention that the cdm_source table should only have one row? I wasn't aware of it but could have easily missed it. I wouldn't throw a warning unless there is something explicitly wrong with the CDM. Maybe it should be a DQ conformance check. Let me know if you'd like any help.

katy-sadowski commented 1 year ago

Is there a CDM convention that the cdm_source table should only have one row?

Not as far as I'm aware - I think this is still an open question (that's why I've changed my mind here not to block DQD execution if a database does have multiple rows in this table). The CDM documentation doesn't explicitly state this one way or the other, but the language describing each column seems to imply it's designed to hold a single row, to document metadata at the whole CDM instance level.

I wouldn't throw a warning unless there is something explicitly wrong with the CDM.

The warning would just be to alert users that the metadata stored/displayed in DQD will only represent one of their CDM_SOURCE rows (assuming we go for the first solution). I also don't think we want to add a DQ check until there's an official CDM convention on this.

@clairblacketer do you have any input on this? Is there a convention for this (in which case CDM docs should probably be updated), or should we file a CDM GitHub issue to decide on one? Thanks 😄