Empiric threshold methodology for DQ Heel rules

vojtechhuser commented 8 years ago

Assuming thresholds can be changed dynamically (see other related issues on github about this), the ongoing DQ study is using the following methodology to arrive at empiric (notification grade) thresholds.

From the draft DQ study manuscript:

Results A total of x (currently at 7) datasets were compared in the study. Results are divided into multiple sections. Data Density Achilles Heel version 1.3 contained no DQA rules that would indicate low data density using a threshold approach. The motivation for this study was to identify an empiric threshold. We used the 10th percentile as an empiric threshold value that generates a DQA error. (or a 90th percentile for some measures)

(For measures Table X lists the data density measures.

Data density can differ with age and in future version of Achilles we hope to implement age decile specific values for some of the measures. There are two types of density to consider (below, we use examples for a laboratory measurement): (1) “concepts per person” as # of distinct measurements per person (e.g., count of 2 measurements per person, such as cholesterol and hematocrit; “data breadth”) (2) “records per person” as total # of all measurement records per person (e.g., count of 8 tests, such as 3 LDL cholesterol and 5 hematocrit measurements; “data depth”).

Also, another paste is

Terminology Achilles uses the term analysis (identifier: analysis_id) to denote a precomputed value from patient level CDM data and derived measure to denote precomputed values that are further derived from Achilles analyses. We use the term data measure to refer to both Achilles analyses and Achilles derived measures.

Achilles Heel allows different level of output. Error for serious DQA errors, warning for less important errors and notification for least important or widely generalizable errors. We use the term error to refer to all three flavors of output/error. (edit: now using term message) and having not just errors.

What would be a better approach? (again, it is meant for notification and for initial Heel report). The best scenario is where DQA thresholds can be tweaked by data customer. (even per database or database type)

vojtechhuser commented 8 years ago

example values for such approach:

               measure_id median percentile10   min   max
       DrugEra:ConceptCnt   1669        593.4     0  1894
  DrugExposure:ConceptCnt  11813       1636.2  1461 41084

vojtechhuser commented 8 years ago

DataQuality package provides more reference values. (from 10 datasets). https://github.com/OHDSI/StudyProtocolSandbox/blob/master/DataQuality/inst/csv/empiric_reference.csv

possible approach to rules may look like this: https://github.com/OHDSI/StudyProtocolSandbox/blob/master/DataQuality/inst/csv/empiric_rule.csv

fdefalco commented 3 years ago

Heel has been superseded by DQD and is no longer under development.

OHDSI / Achilles

Empiric threshold methodology for DQ Heel rules #148