Update `disc_mit()` for multiple replicates

jhseeman commented 4 months ago

Right now, disc_mit() only uses one replicate of the synthetic data. This makes it difficult to establish how the randomness injected into the synthetic data generation process protects against disclosure, since there is always some (small) probability of releasing a dataset nearly identical to the confidential data. This PR extends the aggregated and disaggregated metrics (from #79) to account for variability empirically occurring in multiple replicates of the synthetic data.

Aggregated metrics become distributions, either through summary statistics (mean-variance) or other distribution summaries.
Disaggregated metrics allow for (replicate x confidential data record) statistics, allowing for multiple-replicate-aware statistics (i.e., what is the average true-positive rate at low false-positive rate for records with specific attributes?)

awunderground commented 4 months ago

This is related to #70.

Should we have a postsynth object (replicates == 1) and a multipostsynth object (replicates > 1)? Then we could make all utility and disclosure metrics generate different outputs for these different input objects.

jhseeman commented 4 months ago

Also related to #30 for multiple imputation - I'm more in favor of tidysynthesis-agnostic behavior (so either a list of postsynths or a list of dataframes) - this would also be easier to parallelize / make scalable in the future.

UrbanInstitute / syntheval

Update `disc_mit()` for multiple replicates #80