jhseeman commented 2 months ago

Background

syntheval currently uses one replicate for each evaluation, which obscures the critical effect of randomness in assessing synthetic data disclosure risk and utility. This issue would update syntheval to work with multiple replicates that enables empirical assessment of this randomness, independent of what form it might take. Here, we focus on updating existing metrics for collections of pointwise statistics, although working with multiple replicates introduces new possibilities for other metrics.

Design changes

Currently, functions in syntheval accept either postsynth or tibble / data.frame. There are two approaches we could take here:

Create new functions using the _multirep suffix (ex: util_ci_overlap_multirep()) that explicitly handle multiple replicate logic.
Modify the existing functions to additionally accept list[postsynth] or list[tibble] / list[data.frame]

I'm personally in favor of option 1 for the following reasons:

Option 1 allows for easier parallelization and avoids potential memory issues from recursion in Option 2.
Option 2 could produce long functions that aren't modular, especially if the logic for multiple replicates differs significantly from single replicates.

Open to suggestions / feedback here!

Pointwise statistic distributions

The following methods admit straightforward analogues using multiple replicates by producing distributions of a collection of pointwise statistics:

[ ] util_ci_overlap.R
[ ] util_co_occurrence.R
[ ] util_ks_distance.R
[ ] util_moments.R
[ ] util_percentiles.R
[ ] util_proportions.R
[ ] util_tails.R
[ ] util_totals.R

For each pointwise statistic (eventually a row) in the one-replicate case, we replace it with distributional summary statistics in the multiple replicate case. Here's an example for util_moments()output:

# A tibble: ? × 8
  variable statistic original synth_min   synth_q1  synth_med  synth_q3  synth_max 
  <fct>    <fct>       <dbl>   <dbl>      <dbl>     <dbl>     <dbl>     <dbl>
1 x1       mean        0.1     -0.5       -0.3       0.1       0.4        1.2 
2 x1       mean_diff   0.0     -0.6       -0.2       0.0       0.3        1.1
# etc ...

We can also include an optional argument (akin to simplify=FALSE) that simply returns the evaluation metric applied to each replicate.

Metric-specific considerations:

Wide vs. long format: some outputs are currently in wider format (ex: statistic names are listed as columns instead of rows, like above for mean differences) that would need to pivot to longer.
Non-tabular format: some outputs are currently in non-tabular format (ex: correlation matrices) that would need to be converted to/from the format above.

awunderground commented 2 months ago

How does option 1 make parallelization easier and avoid memory issues?

I worry about doubling the number of functions. What if we:

Create _backend functions like util_moments_backend() that we don't export.
Simplify the existing functions like util_moments() to include methods for individual syntheses and multiple replicates that call the _backend functions.

Alternatively, many of the functions can probably use the same code for iterating the metrics and then reducing the results. We could probably create common functions (iterate_tabular_metrics(), reduce_tabular_metrics()) for these actions and then add them into the existing metrics.

Elyse worked on grouping for util_corr_fit(), which may help with syntax for non-tabular format.

jhseeman commented 2 months ago

I worry about doubling the number of functions. What if we:

1. Create `_backend` functions like `util_moments_backend()` that we don't export.

2. Simplify the existing functions like `util_moments()` to include methods for individual syntheses and multiple replicates that call the `_backend` functions.

I like this idea - I guess we're still doubling the number of total functions, but the public functions stay the same!

Alternatively, many of the functions can probably use the same code for iterating the metrics and then reducing the results. We could probably create common functions (iterate_tabular_metrics(), reduce_tabular_metrics()) for these actions and then add them into the existing metrics.

Yep, definitely some shared utilities here.

Elyse worked on grouping for util_corr_fit(), which may help with syntax for non-tabular format.

Will take a look - are there plans to incorporate similar group-by logic in other places? Also not sure about the current state / plans for that PR

awunderground commented 2 months ago

I think that PR is close but it would need to be resurrected, which is something I haven't considered.

awunderground commented 2 months ago

Be careful: tibbles are lists and could lead to some confusion in code you write.

I think we should create a multipostsynth class in tidysynthesis.

UrbanInstitute / syntheval

General multiple replicate support for pointwise statistic distributions #86

Background

Design changes

Pointwise statistic distributions