Open awunderground opened 11 months ago
Thanks for putting this together, thoughts here before implementing
For bivariate+multivariate metrics, I think the easiest things we can do are:
impute_func
and n_imput
arguments that call an imputation method of choice (ex: off-the-shelf mice
boilerplate) and take the element-wise mean over n_imput
Note that this moves us closer to leveraging multiple replicates in syntheval
, which is where I'd argue we should be going anyways (but beyond the scope of this PR).
For discriminant metrics, I think relying on users to use library(recipes)
is fine with me.
We could add some functionality to add NA as a category for categorical variables. Analysts can use step_indicate_na() to add predictors that capture the structure of the missingness.
Does this assume that users would input confidential and synthetic data already typed as factors? If so, they would have already had to specify whether to treat NA
as a factor, yes? Unless you're suggesting someone might want to use confidential data before this type conversion happens?
Univariate
This is easy for univariate metrics. Add
na.rm
as an argument and use complete cases.util_moments()
util_totals()
util_proportions()
util_percentiles()
util_ks_distance()
util_tails()
Bivariate
This is a little trickier for bivariate variables. In these cases,
na.rm
will need to be applied to two variables at the same time. It also means that each cell in a correlation/co-occurrence matrix could represent different numbers of nonmissing observations, which will be tricky to communicate.Imagine a degenerate cases where the correlation difference for a pair of variables is -0.9 but only two or three observations have nonmissing values for the variables. These types of issues will spill into the numeric summaries of the matrices too.
In these cases, we can store the missing data pattern and throw an error for low-frequency cells.
util_corr_fit()
util_co_occurrence()
Multivariate
This is trickiest for multivariate metrics.
Confidence Interval Overlap
Currently, this use
lm()
, which has an argument forna.function
.We should emulate this for now. This will really be up to analyst since it is a specific utility metric.
Discriminant Metrics
We built a flexible system for these metrics and one view is that analysts should use
library(recipes)
code to account for and get around theNA
.We could add some functionality to add
NA
as a category for categorical variables. Analysts can usestep_indicate_na()
to add predictors that capture the structure of the missingness.