dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
181 stars 61 forks source link

Changing how we create subsets #944

Open KasunAmare opened 1 month ago

KasunAmare commented 1 month ago

In the current version of triage, we view subsets as subsets of entities independent from cohorts we create (unless we add the cohort query to the subset query). As a result, the subset tables tend to have duplicates of the same entity_id for many as of dates (even if those entities are not a part of the cohort for those as of dates) and tends to create very large tables.

This is more acute when we have a large universe of entities. To counter this, either we could include the cohort query in every subset query we write on the experiment config (e.g., as a CTE), or we could modify it under the hood to only include entities from the respective cohorts into the subset (treating subsets as a subset of a cohort rather than a subset of all entities). This PR is attempting to do the latter.

Merging this PR will: