OHDSI / CohortIncidence

Contains the Java and R assets to perform Incidence calculations on a CDM
https://ohdsi.github.io/CohortIncidence/
5 stars 2 forks source link

Optimize #incidence_subgroups creation #46

Closed chrisknoll closed 8 months ago

chrisknoll commented 11 months ago

When calculating incidence on large cohort sizes (>10M people), the brute-force approach where the individual persons in the dataset are SUM()....GROUP BY leads to an aggregation of 10M Records per TAR, per Outcome (ie: rows number of TARs number of Outcomes). Then, depending on the strata settings, this Very Large Dataset is scanned multiple times to do the different age, sex, year stratification, leading to additional sweeps on this intermediate table.

To optimize this, the 'baseline' Time-at-Risk can be aggregated up front, and then only those persons with cases can be aggregated to calculate exclusion time that then feed back into the TAR aggregate to create the completed #incidence_subgroup table.

chrisknoll commented 8 months ago

This has been implemented in v3.3.0.