UrbanInstitute / syntheval

GNU Affero General Public License v3.0
3 stars 0 forks source link

syntheval

syntheval makes it simple to evaluate the utility and disclosure risks of synthetic data. The package is designed to work with postsynth objects from library(tidysynthesis) but also works well with any data frame in R.

Note: library(tidysynthesis) is currently under private development but will be made public in the future.

Navigation

Installation

install.packages("remotes")
remotes::install_github("UrbanInstitute/syntheval")

Utility Metrics

library(tidyverse)
library(syntheval)

Setup

The following examples demonstrate utility and disclosure risk metrics using synthetic data based on the Palmer Penguins dataset. library(syntheval) contains three built-in data sets:

Functions like util_proportions() and util_moments() have different behaviors for postsynth objects and data frames. By default, they only show synthesized variables for postsynth objects and show all common variables for data frames. The common_vars and synth_vars arguments can change this behavior.

Proportions

util_proportions() compares the proportions of classes from categorical variables in the original data and synthetic data.

util_proportions(
  postsynth = penguins_postsynth, 
  data = penguins_conf
)
# A tibble: 2 × 5
  variable class  synthetic original difference
  <chr>    <fct>      <dbl>    <dbl>      <dbl>
1 sex      female     0.529    0.495     0.0330
2 sex      male       0.471    0.505    -0.0330

All common variables are shown when using a data frame.

util_proportions(
  postsynth = penguins_syn_df, 
  data = penguins_conf
)
# A tibble: 8 × 5
  variable class     synthetic original difference
  <chr>    <fct>         <dbl>    <dbl>      <dbl>
1 island   Biscoe        0.465    0.489    -0.0240
2 island   Dream         0.414    0.369     0.0450
3 island   Torgersen     0.120    0.141    -0.0210
4 sex      female        0.529    0.495     0.0330
5 sex      male          0.471    0.505    -0.0330
6 species  Adelie        0.459    0.438     0.0210
7 species  Chinstrap     0.234    0.204     0.0300
8 species  Gentoo        0.306    0.357    -0.0511

Means and Totals

util_moments() compares the counts, means, standard deviations, skewnesses, and kurtoses of the original data and synthetic data.

util_moments(
  postsynth = penguins_postsynth, 
  data = penguins_conf
)
# A tibble: 20 × 6
   variable        statistic original synthetic difference proportion_difference
   <fct>           <fct>        <dbl>     <dbl>      <dbl>                 <dbl>
 1 bill_length_mm  count      3.33e+2  333          0                    0      
 2 bill_length_mm  mean       4.40e+1   43.5       -0.502               -0.0114 
 3 bill_length_mm  sd         5.47e+0    5.54       0.0723               0.0132 
 4 bill_length_mm  skewness   4.51e-2    0.0646     0.0195               0.432  
 5 bill_length_mm  kurtosis  -8.88e-1   -0.948     -0.0598               0.0674 
 6 bill_depth_mm   count      3.33e+2  333          0                    0      
 7 bill_depth_mm   mean       1.72e+1   17.3        0.122                0.00712
 8 bill_depth_mm   sd         1.97e+0    1.89      -0.0762              -0.0387 
 9 bill_depth_mm   skewness  -1.49e-1   -0.278     -0.129                0.867  
10 bill_depth_mm   kurtosis  -8.97e-1   -0.742      0.155               -0.172  
11 flipper_length… count      3.33e+2  333          0                    0      
12 flipper_length… mean       2.01e+2  199.        -1.70                -0.00847
13 flipper_length… sd         1.40e+1   13.9       -0.135               -0.00961
14 flipper_length… skewness   3.59e-1    0.611      0.253                0.705  
15 flipper_length… kurtosis  -9.65e-1   -0.704      0.261               -0.270  
16 body_mass_g     count      3.33e+2  333          0                    0      
17 body_mass_g     mean       4.21e+3 4162.       -45.0                 -0.0107 
18 body_mass_g     sd         8.05e+2  783.       -22.3                 -0.0277 
19 body_mass_g     skewness   4.70e-1    0.655      0.185                0.394  
20 body_mass_g     kurtosis  -7.40e-1   -0.388      0.353               -0.477  

util_totals() is similar to util_moments() but looks at counts and totals.

util_totals(
  postsynth = penguins_postsynth, 
  data = penguins_conf
)
# A tibble: 8 × 6
  variable         statistic original synthetic difference proportion_difference
  <fct>            <fct>        <dbl>     <dbl>      <dbl>                 <dbl>
1 bill_length_mm   count         333       333         0                 0      
2 bill_length_mm   total       14650.    14483.     -167.               -0.0114 
3 bill_depth_mm    count         333       333         0                 0      
4 bill_depth_mm    total        5716.     5757.       40.7               0.00712
5 flipper_length_… count         333       333         0                 0      
6 flipper_length_… total       66922     66355      -567                -0.00847
7 body_mass_g      count         333       333         0                 0      
8 body_mass_g      total     1400950   1385950    -15000                -0.0107 

Percentiles

util_percentiles() compares percentiles from the original data and synthetic data. The default percentiles are c(0.1, 0.5, 0.9) and can be easily overwritten.

util_percentiles(
  postsynth = penguins_postsynth, 
  data = penguins_conf,
  probs = c(0.5, 0.8)
)
# A tibble: 8 × 6
      p variable          original synthetic difference proportion_difference
  <dbl> <fct>                <dbl>     <dbl>      <dbl>                 <dbl>
1   0.5 bill_length_mm        44.5      43.5    -1                   -0.0225 
2   0.8 bill_length_mm        49.5      49.1    -0.440               -0.00889
3   0.5 bill_depth_mm         17.3      17.5     0.200                0.0116 
4   0.8 bill_depth_mm         18.9      19.0     0.0600               0.00317
5   0.5 flipper_length_mm    197       195      -2                   -0.0102 
6   0.8 flipper_length_mm    215       214      -1                   -0.00465
7   0.5 body_mass_g         4050      3950    -100                   -0.0247 
8   0.8 body_mass_g         4990      4850    -140                   -0.0281 

The functions are designed to work well with library(ggplot2).

util_percentiles(
  postsynth = penguins_postsynth, 
  data = penguins_conf,
  probs = seq(0.01, 0.99, 0.01)
) |>
  pivot_longer(
    cols = c(original, synthetic),
    names_to = "source",
    values_to = "value"
  ) |>
  ggplot(aes(x = p, y = value, color = source)) +
  geom_line() +
  facet_wrap(~ variable, scales = "free")

KS Distance

util_ks_distance() shows the Kolmogorov-Smirnov distance between the original distribution and synthetic distribution for numeric variables. The function also returns the point(s) of the maximum distance.

util_ks_distance(
  postsynth = penguins_syn_df, 
  data = penguins_conf
)
# A tibble: 14 × 3
   variable           value      D
   <chr>              <dbl>  <dbl>
 1 bill_length_mm      38.7 0.0601
 2 bill_depth_mm       16.7 0.0511
 3 bill_depth_mm       16.7 0.0511
 4 bill_depth_mm       16.8 0.0511
 5 bill_depth_mm       16.8 0.0511
 6 flipper_length_mm  196.  0.0781
 7 flipper_length_mm  196.  0.0781
 8 flipper_length_mm  197.  0.0781
 9 flipper_length_mm  197.  0.0781
10 flipper_length_mm  197.  0.0781
11 body_mass_g       4359.  0.0480
12 body_mass_g       4370.  0.0480
13 body_mass_g       4381.  0.0480
14 body_mass_g       4392.  0.0480

Co-Occurrence

util_co_occurrence() differences the lower triangles of co-occurrence matrices calculated on numeric variables in the original data and synthetic data.

co_occurrence <- util_co_occurrence(
  postsynth = penguins_postsynth, 
  data = penguins_conf
)

co_occurrence$co_occurrence_difference
                  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm                NA            NA                NA          NA
bill_depth_mm                  0            NA                NA          NA
flipper_length_mm              0             0                NA          NA
body_mass_g                    0             0                 0          NA

The function returns the MAE for co-occurrences, which provides a sense of the median error between the original and synthetic data. The function also returns the RMSE for co-occurrences, which provides a sense of the average error between the original data and synthetic data.

co_occurrence$co_occurrence_difference_mae
[1] 0
co_occurrence$co_occurrence_difference_rmse
[1] 0

All observations have non-zero bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. util_co_occurrence() is most useful for economic variables like income and wealth where 0 is a common value.

Correlations

util_corr_fit() differences the lower triangles of correlation matrices calculated on numeric variables in the original data and synthetic data.

corr_fit <- util_corr_fit(
  postsynth = penguins_postsynth, 
  data = penguins_conf
)

round(corr_fit$correlation_difference, digits = 3)
                  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm                NA            NA                NA          NA
bill_depth_mm             -0.069            NA                NA          NA
flipper_length_mm         -0.003        -0.048                NA          NA
body_mass_g                0.031        -0.011             -0.08          NA

The function returns the MAE for correlation coefficients, which provides a sense of the median error between the original and synthetic data. The function also returns the RMSE for the correlation coefficients, which provides a sense of the average error between the original synthetic data.

corr_fit$correlation_difference_mae
[1] 0.04034565
corr_fit$correlation_difference_rmse
[1] 0.04922324

Coefficient Overlap

util_ci_overlap() compares a linear regression models estimated on the original data and synthetic data. formula specifies the functional form of the regression model.

ci_overlap <- util_ci_overlap(
  postsynth = penguins_postsynth, 
  data = penguins_conf,
  formula = body_mass_g ~ bill_length_mm  + sex
)

$ci_overlap() summarizes each coefficient including how much the confidence intervals overlap, if the signs match, and if the statistical significance matches.

ci_overlap$ci_overlap 
# A tibble: 3 × 8
  term    overlap coef_diff std_coef_diff sign_match significance_match ss_match
  <chr>     <dbl>     <dbl>         <dbl> <lgl>      <lgl>              <lgl>   
1 (Inter…   0.963   -27.8         -0.0978 TRUE       TRUE               TRUE    
2 bill_l…   0.965     0.917        0.138  TRUE       TRUE               TRUE    
3 sexmale   0.951   -14.0         -0.192  TRUE       TRUE               TRUE    
# ℹ 1 more variable: sso_match <lgl>

$coefficient provides detail for each coefficient and is useful for data visualization.

ci_overlap$coefficient
# A tibble: 6 × 8
  source    term        estimate std.error statistic  p.value conf.low conf.high
  <chr>     <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 original  (Intercept)    746.     285.        2.62 9.22e- 3    186.     1306. 
2 original  bill_lengt…     74.0      6.67     11.1  1.51e-24     60.9      87.1
3 original  sexmale        405.      72.8       5.57 5.43e- 8    262.      548. 
4 synthetic (Intercept)    718.     264.        2.72 6.77e- 3    200.     1237. 
5 synthetic bill_lengt…     74.9      6.25     12.0  8.98e-28     62.7      87.2
6 synthetic sexmale        391.      69.2       5.65 3.42e- 8    255.      527. 
ci_overlap$coefficient |>
  ggplot(aes(x = estimate, xmin = conf.low, xmax = conf.high, y = term, color = source)) +
  geom_pointrange(alpha = 0.5, position = position_dodge(width = 0.5)) +
  labs(
    title = "The Synthesizer Recreates the Point Estimates and Confidence Intervals",
    subtitle = "Regression Confidence Interval Overlap"
  )

Discriminant-Based Metrics

Discriminant-based metrics build models to predict if an observation is original or synthetic and then evaluate those model predictions. Ideally, it should be difficult for a model to distinguish, or discriminate, between original observations and synthetic observations.

Example Using Decision Trees

Discriminant-based metrics are built a discrimination object created by discrimination().

disc1 <- discrimination(postsynth = penguins_postsynth, data = penguins_conf)

Next, we use library(tidymodels) to specify a model. We recommend the tidymodels tutorial to learn more.

library(tidymodels)

rpart_rec <- recipe(
  .source_label ~ ., 
  data = disc1$combined_data
)

rpart_mod <- decision_tree(cost_complexity = 0.01) |>
  set_mode(mode = "classification") |>
  set_engine(engine = "rpart")

Next, we fit the model to the data to generate predicted probabilities.

disc1 <- disc1 |>
  add_propensities(
    recipe = rpart_rec,
    spec = rpart_mod
  ) 

At this point, we can use

disc1 |>
  add_discriminator_auc() |>
  add_specks() |>
  add_pmse() |>
  add_pmse_ratio(times = 25)
$combined_data
# A tibble: 666 × 8
   .source_label species island    sex    bill_length_mm bill_depth_mm
   <fct>         <fct>   <fct>     <fct>           <dbl>         <dbl>
 1 original      Adelie  Torgersen male             39.1          18.7
 2 original      Adelie  Torgersen female           39.5          17.4
 3 original      Adelie  Torgersen female           40.3          18  
 4 original      Adelie  Torgersen female           36.7          19.3
 5 original      Adelie  Torgersen male             39.3          20.6
 6 original      Adelie  Torgersen female           38.9          17.8
 7 original      Adelie  Torgersen male             39.2          19.6
 8 original      Adelie  Torgersen female           41.1          17.6
 9 original      Adelie  Torgersen male             38.6          21.2
10 original      Adelie  Torgersen male             34.6          21.1
# ℹ 656 more rows
# ℹ 2 more variables: flipper_length_mm <dbl>, body_mass_g <dbl>

$propensities
# A tibble: 666 × 10
   .pred_synthetic .source_label .sample  species island    sex   bill_length_mm
             <dbl> <fct>         <chr>    <fct>   <fct>     <fct>          <dbl>
 1          0.0556 original      training Adelie  Torgersen male            39.1
 2          0.355  original      training Adelie  Torgersen fema…           39.5
 3          0.636  original      training Adelie  Torgersen fema…           40.3
 4          0.375  original      training Adelie  Torgersen fema…           36.7
 5          0.76   original      testing  Adelie  Torgersen male            39.3
 6          0.355  original      training Adelie  Torgersen fema…           38.9
 7          0.375  original      training Adelie  Torgersen male            39.2
 8          0.355  original      training Adelie  Torgersen fema…           41.1
 9          0.4    original      training Adelie  Torgersen male            38.6
10          0.789  original      testing  Adelie  Torgersen male            34.6
# ℹ 656 more rows
# ℹ 3 more variables: bill_depth_mm <dbl>, flipper_length_mm <dbl>,
#   body_mass_g <dbl>

$discriminator
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
n= 498 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

   1) root 498 249 synthetic (0.50000000 0.50000000)  
     2) bill_length_mm< 34.8 19   4 synthetic (0.78947368 0.21052632) *
     3) bill_length_mm>=34.8 479 234 original (0.48851775 0.51148225)  
       6) bill_depth_mm>=18.85 99  44 synthetic (0.55555556 0.44444444)  
        12) bill_length_mm< 50.35 85  32 synthetic (0.62352941 0.37647059)  
          24) species=Chinstrap 21   2 synthetic (0.90476190 0.09523810) *
          25) species=Adelie 64  30 synthetic (0.53125000 0.46875000)  
            50) flipper_length_mm< 192.5 40  15 synthetic (0.62500000 0.37500000)  
             100) bill_length_mm>=38.9 25   6 synthetic (0.76000000 0.24000000) *
             101) bill_length_mm< 38.9 15   6 original (0.40000000 0.60000000) *
            51) flipper_length_mm>=192.5 24   9 original (0.37500000 0.62500000) *
        13) bill_length_mm>=50.35 14   2 original (0.14285714 0.85714286) *
       7) bill_depth_mm< 18.85 380 179 original (0.47105263 0.52894737)  
        14) bill_length_mm>=50.55 45  17 synthetic (0.62222222 0.37777778) *
        15) bill_length_mm< 50.55 335 151 original (0.45074627 0.54925373)  
          30) bill_length_mm< 46.05 218 107 original (0.49082569 0.50917431)  
            60) bill_depth_mm< 18.25 177  83 synthetic (0.53107345 0.46892655)  
             120) bill_depth_mm>=13.75 169  77 synthetic (0.54437870 0.45562130)  
               240) bill_length_mm>=36.55 142  61 synthetic (0.57042254 0.42957746)  
                 480) bill_length_mm< 38.35 27   8 synthetic (0.70370370 0.29629630) *
                 481) bill_length_mm>=38.35 115  53 synthetic (0.53913043 0.46086957)  
                   962) bill_length_mm>=41.65 73  29 synthetic (0.60273973 0.39726027) *
                   963) bill_length_mm< 41.65 42  18 original (0.42857143 0.57142857)  
                    1926) flipper_length_mm>=193.5 11   4 synthetic (0.63636364 0.36363636) *
                    1927) flipper_length_mm< 193.5 31  11 original (0.35483871 0.64516129) *
               241) bill_length_mm< 36.55 27  11 original (0.40740741 0.59259259) *
             121) bill_depth_mm< 13.75 8   2 original (0.25000000 0.75000000) *
            61) bill_depth_mm>=18.25 41  13 original (0.31707317 0.68292683)  
             122) flipper_length_mm>=191.5 23  11 synthetic (0.52173913 0.47826087)  
               244) bill_depth_mm>=18.55 10   2 synthetic (0.80000000 0.20000000) *
               245) bill_depth_mm< 18.55 13   4 original (0.30769231 0.69230769) *
             123) flipper_length_mm< 191.5 18   1 original (0.05555556 0.94444444) *
          31) bill_length_mm>=46.05 117  44 original (0.37606838 0.62393162)  
            62) flipper_length_mm>=229.5 9   3 synthetic (0.66666667 0.33333333) *
            63) flipper_length_mm< 229.5 108  38 original (0.35185185 0.64814815)  
             126) flipper_length_mm< 214.5 54  24 original (0.44444444 0.55555556)  
               252) body_mass_g>=3925 26  10 synthetic (0.61538462 0.38461538) *
               253) body_mass_g< 3925 28   8 original (0.28571429 0.71428571) *
             127) flipper_length_mm>=214.5 54  14 original (0.25925926 0.74074074) *

$discriminator_auc
# A tibble: 2 × 4
  .sample  .metric .estimator .estimate
  <fct>    <chr>   <chr>          <dbl>
1 training roc_auc binary         0.743
2 testing  roc_auc binary         0.461

$pmse
# A tibble: 2 × 4
  .source   .pmse .null_pmse .pmse_ratio
  <fct>     <dbl>      <dbl>       <dbl>
1 training 0.0465     0.0327        1.42
2 testing  0.0483     0.0325        1.49

$specks
# A tibble: 2 × 2
  .source  .specks
  <fct>      <dbl>
1 training  0.386 
2 testing   0.0833

attr(,"class")
[1] "discrimination"

Finally, we can look at variable importance and the decision tree from our discriminator.

library(vip)
library(rpart.plot)

disc1$discriminator |> 
  extract_fit_parsnip() |> 
  vip()

disc1$discriminator$fit$fit$fit |>
  prp()
Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
To silence this warning:
    Call prp with roundint=FALSE,
    or rebuild the rpart model with model=TRUE.

Example Using Regularized Regression

Let’s repeat the workflow from above with LASSO logistic regression and hyperparameter tuning.

# create discrimination
disc2 <- discrimination(postsynth = penguins_postsynth, data = penguins_conf)

# create a recipe that includes 2nd-degree polynomials, dummy variables, and
# standardization
lasso_rec <- recipe(
  .source_label ~ ., 
  data = disc2$combined_data
) |>
  step_poly(all_numeric_predictors(), degree = 2) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_predictors())

# create the model
lasso_mod <- logistic_reg(
  penalty = tune(), 
  mixture = 1
) |>
  set_engine(engine = "glmnet") |>
  set_mode(mode = "classification")

# create a tuning grid
lasso_grid <- grid_regular(penalty(), levels = 10)

# add the propensities
disc2 <- disc2 |>
  add_propensities_tuned(
    recipe = lasso_rec,
    spec = lasso_mod,
    grid = lasso_grid
  ) 

# calculate metrics
disc2 |>
  add_discriminator_auc() |>
  add_specks() |>
  add_pmse() |>
  add_pmse_ratio(times = 25)
$combined_data
# A tibble: 666 × 8
   .source_label species island    sex    bill_length_mm bill_depth_mm
   <fct>         <fct>   <fct>     <fct>           <dbl>         <dbl>
 1 original      Adelie  Torgersen male             39.1          18.7
 2 original      Adelie  Torgersen female           39.5          17.4
 3 original      Adelie  Torgersen female           40.3          18  
 4 original      Adelie  Torgersen female           36.7          19.3
 5 original      Adelie  Torgersen male             39.3          20.6
 6 original      Adelie  Torgersen female           38.9          17.8
 7 original      Adelie  Torgersen male             39.2          19.6
 8 original      Adelie  Torgersen female           41.1          17.6
 9 original      Adelie  Torgersen male             38.6          21.2
10 original      Adelie  Torgersen male             34.6          21.1
# ℹ 656 more rows
# ℹ 2 more variables: flipper_length_mm <dbl>, body_mass_g <dbl>

$propensities
# A tibble: 666 × 10
   .pred_synthetic .source_label .sample  species island    sex   bill_length_mm
             <dbl> <fct>         <chr>    <fct>   <fct>     <fct>          <dbl>
 1             0.5 original      training Adelie  Torgersen male            39.1
 2             0.5 original      training Adelie  Torgersen fema…           39.5
 3             0.5 original      training Adelie  Torgersen fema…           40.3
 4             0.5 original      training Adelie  Torgersen fema…           36.7
 5             0.5 original      testing  Adelie  Torgersen male            39.3
 6             0.5 original      testing  Adelie  Torgersen fema…           38.9
 7             0.5 original      training Adelie  Torgersen male            39.2
 8             0.5 original      training Adelie  Torgersen fema…           41.1
 9             0.5 original      training Adelie  Torgersen male            38.6
10             0.5 original      testing  Adelie  Torgersen male            34.6
# ℹ 656 more rows
# ℹ 3 more variables: bill_depth_mm <dbl>, flipper_length_mm <dbl>,
#   body_mass_g <dbl>

$discriminator
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_poly()
• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────

Call:  glmnet::glmnet(x = maybe_matrix(x), y = y, family = "binomial",      alpha = ~1) 

   Df %Dev   Lambda
1   0 0.00 0.041780
2   1 0.09 0.038070
3   1 0.16 0.034680
4   1 0.22 0.031600
5   2 0.27 0.028800
6   2 0.36 0.026240
7   2 0.43 0.023910
8   4 0.51 0.021780
9   4 0.58 0.019850
10  4 0.63 0.018080
11  4 0.68 0.016480
12  6 0.72 0.015010
13  6 0.78 0.013680
14  6 0.83 0.012460
15  6 0.87 0.011360
16  7 0.90 0.010350
17  7 0.94 0.009429
18  7 0.97 0.008592
19  7 0.99 0.007828
20  7 1.02 0.007133
21  8 1.06 0.006499
22  8 1.15 0.005922
23  8 1.22 0.005396
24  9 1.29 0.004916
25  9 1.34 0.004480
26 11 1.39 0.004082
27 11 1.44 0.003719
28 11 1.49 0.003389
29 11 1.53 0.003088
30 11 1.56 0.002813
31 11 1.58 0.002563
32 11 1.61 0.002336
33 11 1.62 0.002128
34 11 1.64 0.001939
35 11 1.65 0.001767
36 12 1.66 0.001610
37 12 1.68 0.001467
38 12 1.69 0.001337
39 12 1.70 0.001218
40 11 1.70 0.001110
41 11 1.71 0.001011
42 11 1.71 0.000921
43 11 1.72 0.000839
44 11 1.72 0.000765
45 11 1.72 0.000697
46 11 1.73 0.000635

...
and 6 more lines.

$discriminator_auc
# A tibble: 2 × 4
  .sample  .metric .estimator .estimate
  <fct>    <chr>   <chr>          <dbl>
1 training roc_auc binary           0.5
2 testing  roc_auc binary           0.5

$pmse
# A tibble: 2 × 4
  .source     .pmse .null_pmse .pmse_ratio
  <fct>       <dbl>      <dbl>       <dbl>
1 training 1.23e-32   1.23e-32           1
2 testing  1.23e-32   1.23e-32           1

$specks
# A tibble: 2 × 2
  .source   .specks
  <fct>       <dbl>
1 training 4.86e-17
2 testing  6.94e-17

attr(,"class")
[1] "discrimination"
# look at variable importance
library(vip)

disc2$discriminator |> 
  extract_fit_parsnip() |> 
  vip()

Additional Functionality

Grouping

Many utility metrics include a group_by argument to group the metrics by group during calculation. For example, this code calculates moments by species.

util_moments(
  postsynth = penguins_postsynth, 
  data = penguins_conf,
  group_by = species
)
# A tibble: 60 × 7
   species   variable       statistic original synthetic difference
   <fct>     <fct>          <fct>        <dbl>     <dbl>      <dbl>
 1 Adelie    bill_length_mm count      146       153          7    
 2 Chinstrap bill_length_mm count       68        78         10    
 3 Gentoo    bill_length_mm count      119       102        -17    
 4 Adelie    bill_length_mm mean        38.8      38.5       -0.294
 5 Chinstrap bill_length_mm mean        48.8      47.3       -1.56 
 6 Gentoo    bill_length_mm mean        47.6      48.0        0.475
 7 Adelie    bill_length_mm sd           2.66      2.93       0.267
 8 Chinstrap bill_length_mm sd           3.34      3.07      -0.273
 9 Gentoo    bill_length_mm sd           3.11      3.40       0.299
10 Adelie    bill_length_mm skewness     0.156     0.505      0.349
# ℹ 50 more rows
# ℹ 1 more variable: proportion_difference <dbl>

Weighting

Many utility metrics include a weight_var argument to use weighted statistics during calculation. For example, this code weights the moments by the body weight of the penguins.

util_moments(
  postsynth = penguins_postsynth, 
  data = penguins_conf,
  weight_var = body_mass_g
)
# A tibble: 20 × 6
   variable        statistic original synthetic difference proportion_difference
   <fct>           <fct>        <dbl>     <dbl>      <dbl>                 <dbl>
 1 bill_length_mm  count      1.40e+6   1.39e+6   -1.5 e+4              -0.0107 
 2 bill_length_mm  mean       4.46e+1   4.41e+1   -4.71e-1              -0.0106 
 3 bill_length_mm  sd         5.38e+0   5.54e+0    1.60e-1               0.0297 
 4 bill_length_mm  skewness  -7.39e-2  -5.74e-2    1.64e-2              -0.222  
 5 bill_length_mm  kurtosis  -7.89e-1  -9.15e-1   -1.26e-1               0.160  
 6 bill_depth_mm   count      1.40e+6   1.39e+6   -1.5 e+4              -0.0107 
 7 bill_depth_mm   mean       1.70e+1   1.71e+1    1.28e-1               0.00754
 8 bill_depth_mm   sd         2.01e+0   1.94e+0   -7.07e-2              -0.0351 
 9 bill_depth_mm   skewness   1.05e-2  -1.37e-1   -1.47e-1             -14.0    
10 bill_depth_mm   kurtosis  -1.00e+0  -9.17e-1    8.70e-2              -0.0867 
11 flipper_length… count      1.40e+6   1.39e+6   -1.5 e+4              -0.0107 
12 flipper_length… mean       2.03e+2   2.01e+2   -1.97e+0              -0.00970
13 flipper_length… sd         1.44e+1   1.46e+1    1.60e-1               0.0111 
14 flipper_length… skewness   1.48e-1   3.99e-1    2.51e-1               1.70   
15 flipper_length… kurtosis  -1.14e+0  -1.04e+0    1.08e-1              -0.0940 
16 body_mass_g     count      1.40e+6   1.39e+6   -1.5 e+4              -0.0107 
17 body_mass_g     mean       4.36e+3   4.31e+3   -5.19e+1              -0.0119 
18 body_mass_g     sd         8.26e+2   8.17e+2   -9.84e+0              -0.0119 
19 body_mass_g     skewness   2.69e-1   4.64e-1    1.95e-1               0.724  
20 body_mass_g     kurtosis  -9.70e-1  -7.26e-1    2.44e-1              -0.251  

Most commonly, weight_var is used when synthesizing data from surveys.

Getting Help

Contact Aaron R. Williams with feedback or questions.