Open jhseeman opened 2 months ago
How does option 1 make parallelization easier and avoid memory issues?
I worry about doubling the number of functions. What if we:
_backend
functions like util_moments_backend()
that we don't export. util_moments()
to include methods for individual syntheses and multiple replicates that call the _backend
functions. Alternatively, many of the functions can probably use the same code for iterating the metrics and then reducing the results. We could probably create common functions (iterate_tabular_metrics()
, reduce_tabular_metrics()
) for these actions and then add them into the existing metrics.
Elyse worked on grouping for util_corr_fit()
, which may help with syntax for non-tabular format.
I worry about doubling the number of functions. What if we:
1. Create `_backend` functions like `util_moments_backend()` that we don't export. 2. Simplify the existing functions like `util_moments()` to include methods for individual syntheses and multiple replicates that call the `_backend` functions.
I like this idea - I guess we're still doubling the number of total functions, but the public functions stay the same!
Alternatively, many of the functions can probably use the same code for iterating the metrics and then reducing the results. We could probably create common functions (
iterate_tabular_metrics()
,reduce_tabular_metrics()
) for these actions and then add them into the existing metrics.
Yep, definitely some shared utilities here.
Elyse worked on grouping for
util_corr_fit()
, which may help with syntax for non-tabular format.
Will take a look - are there plans to incorporate similar group-by logic in other places? Also not sure about the current state / plans for that PR
I think that PR is close but it would need to be resurrected, which is something I haven't considered.
Be careful: tibbles are lists and could lead to some confusion in code you write.
I think we should create a multipostsynth
class in tidysynthesis.
Background
syntheval
currently uses one replicate for each evaluation, which obscures the critical effect of randomness in assessing synthetic data disclosure risk and utility. This issue would updatesyntheval
to work with multiple replicates that enables empirical assessment of this randomness, independent of what form it might take. Here, we focus on updating existing metrics for collections of pointwise statistics, although working with multiple replicates introduces new possibilities for other metrics.Design changes
Currently, functions in
syntheval
accept eitherpostsynth
ortibble
/data.frame
. There are two approaches we could take here:_multirep
suffix (ex:util_ci_overlap_multirep()
) that explicitly handle multiple replicate logic.list[postsynth]
orlist[tibble]
/list[data.frame]
I'm personally in favor of option 1 for the following reasons:
Open to suggestions / feedback here!
Pointwise statistic distributions
The following methods admit straightforward analogues using multiple replicates by producing distributions of a collection of pointwise statistics:
util_ci_overlap.R
util_co_occurrence.R
util_ks_distance.R
util_moments.R
util_percentiles.R
util_proportions.R
util_tails.R
util_totals.R
For each pointwise statistic (eventually a row) in the one-replicate case, we replace it with distributional summary statistics in the multiple replicate case. Here's an example for
util_moments()
output:We can also include an optional argument (akin to
simplify=FALSE
) that simply returns the evaluation metric applied to each replicate.Metric-specific considerations: