UrbanInstitute / syntheval

GNU Affero General Public License v3.0
3 stars 0 forks source link

Add consistent `na.rm` options to functions #30

Open awunderground opened 11 months ago

awunderground commented 11 months ago

Univariate

This is easy for univariate metrics. Add na.rm as an argument and use complete cases.

Bivariate

This is a little trickier for bivariate variables. In these cases, na.rm will need to be applied to two variables at the same time. It also means that each cell in a correlation/co-occurrence matrix could represent different numbers of nonmissing observations, which will be tricky to communicate.

Imagine a degenerate cases where the correlation difference for a pair of variables is -0.9 but only two or three observations have nonmissing values for the variables. These types of issues will spill into the numeric summaries of the matrices too.

In these cases, we can store the missing data pattern and throw an error for low-frequency cells.

Multivariate

This is trickiest for multivariate metrics.

Confidence Interval Overlap

Currently, this use lm(), which has an argument for na.function.

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful.

We should emulate this for now. This will really be up to analyst since it is a specific utility metric.

Discriminant Metrics

We built a flexible system for these metrics and one view is that analysts should use library(recipes) code to account for and get around the NA.

We could add some functionality to add NA as a category for categorical variables. Analysts can use step_indicate_na() to add predictors that capture the structure of the missingness.

jhseeman commented 2 months ago

Thanks for putting this together, thoughts here before implementing