Add consistent `na.rm` options to functions

Univariate

This is easy for univariate metrics. Add na.rm as an argument and use complete cases.

[ ] util_moments()
[ ] util_totals()
[ ] util_proportions()
[ ] util_percentiles()
[ ] util_ks_distance()
[ ] util_tails()

Bivariate

This is a little trickier for bivariate variables. In these cases, na.rm will need to be applied to two variables at the same time. It also means that each cell in a correlation/co-occurrence matrix could represent different numbers of nonmissing observations, which will be tricky to communicate.

Imagine a degenerate cases where the correlation difference for a pair of variables is -0.9 but only two or three observations have nonmissing values for the variables. These types of issues will spill into the numeric summaries of the matrices too.

In these cases, we can store the missing data pattern and throw an error for low-frequency cells.

[ ] util_corr_fit()
[ ] util_co_occurrence()

Multivariate

This is trickiest for multivariate metrics.

[ ] `util_ci_overlap()~
[ ] Discriminator process

Confidence Interval Overlap

Currently, this use lm(), which has an argument for na.function.

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful.

We should emulate this for now. This will really be up to analyst since it is a specific utility metric.

Discriminant Metrics

We built a flexible system for these metrics and one view is that analysts should use library(recipes) code to account for and get around the NA.

We could add some functionality to add NA as a category for categorical variables. Analysts can use step_indicate_na() to add predictors that capture the structure of the missingness.

Thanks for putting this together, thoughts here before implementing

For bivariate+multivariate metrics, I think the easiest things we can do are:
1. By default, use complete cases and raise warnings for variables with missing data and output proportion of rows dropped
2. Accept optional impute_func and n_imput arguments that call an imputation method of choice (ex: off-the-shelf mice boilerplate) and take the element-wise mean over n_imput
Note that this moves us closer to leveraging multiple replicates in syntheval, which is where I'd argue we should be going anyways (but beyond the scope of this PR).
For discriminant metrics, I think relying on users to use library(recipes) is fine with me.

We could add some functionality to add NA as a category for categorical variables. Analysts can use step_indicate_na() to add predictors that capture the structure of the missingness.

Does this assume that users would input confidential and synthetic data already typed as factors? If so, they would have already had to specify whether to treat NA as a factor, yes? Unless you're suggesting someone might want to use confidential data before this type conversion happens?

UrbanInstitute / syntheval