UrbanInstitute / syntheval

GNU Affero General Public License v3.0
3 stars 0 forks source link

Update `disc_mit()` for multiple replicates #80

Open jhseeman opened 4 months ago

jhseeman commented 4 months ago

Right now, disc_mit() only uses one replicate of the synthetic data. This makes it difficult to establish how the randomness injected into the synthetic data generation process protects against disclosure, since there is always some (small) probability of releasing a dataset nearly identical to the confidential data. This PR extends the aggregated and disaggregated metrics (from #79) to account for variability empirically occurring in multiple replicates of the synthetic data.

awunderground commented 4 months ago

This is related to #70.

Should we have a postsynth object (replicates == 1) and a multipostsynth object (replicates > 1)? Then we could make all utility and disclosure metrics generate different outputs for these different input objects.

jhseeman commented 4 months ago

Also related to #30 for multiple imputation - I'm more in favor of tidysynthesis-agnostic behavior (so either a list of postsynths or a list of dataframes) - this would also be easier to parallelize / make scalable in the future.