Open TomWhite-MedStar opened 1 year ago
Thanks for sharing this idea @TomWhite-MedStar ! This is on theme with the higher level ideas we've been discussing around minimizing unnecessary repetition within & across DQD runs, and performance in general.
We'll probably prioritize the broader overhaul of DQD workflow we were discussing in the Hackathon yesterday before considering any major changes to the check SQL & results/reporting like this one. It's possible we'll address these same requirements differently once we've changed the check execution workflow. And rethinking cohort-level execution will fit in well with subsequent work we hope to do on study/cohort-level data quality. I'll certainly keep this in the backlog to revisit later on!
Problem Statement
On some databases, profiling DQD for a single cohort may be slower than profiling the entire dataset. For example, on our Databricks instance, it takes longer to profile a single cohort than it does to profile the entire dataset.
Enhancement Request
Add option to profile all cohorts within cohortDatabaseSchema.cohortTableName simulataneously by.
'@groupByCohort' == 'Yes'
, add aGROUP BY c.cohort_definition_id
clause to the generated query'@groupByCohort' == 'Yes'
, addc.cohort_definition_id
to the list of fields returned by each query's SELECT statement'@groupByCohort' == 'Yes'
, addcohort_definition_id INTEGER
to the table DDL in result_table_ddl.sqlAND c.cohort_definition_id = @cohortDefinitionId
optional; and change them toAND c.cohort_definition_id IN (@cohortDefinitionId)
. That way, if uses do not pass a value for cohortDefinitionId, DQD will process all cohorts in cohortDatabaseSchema.cohortTableName. Or, users can pass either a single numerical value for cohortDefinitionId or a list of numerical values.Benchmarking Data
Our OMOP instance has 5M patients and 8 years of EHR data. This generates an OMOP instance with 8.5B records across the non-vocabulary CDM tables (excluding the notes tables).
We use the following DQD configuration to pre-generates SQL code in Spark format:
When we run the generated SQL code against a Databricks Serverless XL warehouse, it takes:
I suspect that it would take only a small percentage more time to process all cohorts concurrently.