OHDSI / DataQualityDashboard

A tool to help improve data quality standards in observational data science.
https://ohdsi.github.io/DataQualityDashboard
Apache License 2.0
145 stars 96 forks source link

enhancement: profile all cohorts concurrently #491

Open TomWhite-MedStar opened 1 year ago

TomWhite-MedStar commented 1 year ago

Problem Statement

On some databases, profiling DQD for a single cohort may be slower than profiling the entire dataset. For example, on our Databricks instance, it takes longer to profile a single cohort than it does to profile the entire dataset.

Enhancement Request

Add option to profile all cohorts within cohortDatabaseSchema.cohortTableName simulataneously by.

  1. Add a Boolean parameter -- perhaps groupByCohort
  2. When '@groupByCohort' == 'Yes', add a GROUP BY c.cohort_definition_id clause to the generated query
  3. When '@groupByCohort' == 'Yes', add c.cohort_definition_id to the list of fields returned by each query's SELECT statement
  4. When '@groupByCohort' == 'Yes', add cohort_definition_id INTEGER to the table DDL in result_table_ddl.sql
  5. Update conditional SQL generation to make the clauses AND c.cohort_definition_id = @cohortDefinitionId optional; and change them to AND c.cohort_definition_id IN (@cohortDefinitionId). That way, if uses do not pass a value for cohortDefinitionId, DQD will process all cohorts in cohortDatabaseSchema.cohortTableName. Or, users can pass either a single numerical value for cohortDefinitionId or a list of numerical values.

Benchmarking Data

Our OMOP instance has 5M patients and 8 years of EHR data. This generates an OMOP instance with 8.5B records across the non-vocabulary CDM tables (excluding the notes tables).

We use the following DQD configuration to pre-generates SQL code in Spark format:

sqlOnly <- TRUE
sqlOnlyUnionCount <- 100
sqlOnlyIncrementalInsert <- TRUE

When we run the generated SQL code against a Databricks Serverless XL warehouse, it takes:

I suspect that it would take only a small percentage more time to process all cohorts concurrently.

katy-sadowski commented 1 year ago

Thanks for sharing this idea @TomWhite-MedStar ! This is on theme with the higher level ideas we've been discussing around minimizing unnecessary repetition within & across DQD runs, and performance in general.

We'll probably prioritize the broader overhaul of DQD workflow we were discussing in the Hackathon yesterday before considering any major changes to the check SQL & results/reporting like this one. It's possible we'll address these same requirements differently once we've changed the check execution workflow. And rethinking cohort-level execution will fit in well with subsequent work we hope to do on study/cohort-level data quality. I'll certainly keep this in the backlog to revisit later on!