syntheval makes it simple to evaluate the utility and disclosure risks
of synthetic data. The package is designed to work with postsynth
objects from library(tidysynthesis)
but also works well with any data
frame in R.
Note: library(tidysynthesis)
is currently under private
development but will be made public in the future.
install.packages("remotes")
remotes::install_github("UrbanInstitute/syntheval")
library(tidyverse)
library(syntheval)
The following examples demonstrate utility and disclosure risk metrics
using synthetic data based on the Palmer
Penguins dataset.
library(syntheval)
contains three built-in data sets:
penguins_conf
: Pre-processed penguins
data that were passed into
the synthesizer.penguins_postsynth
: A postsynth
object synthesized from penguins
using library(tidysynthesis)
.penguins_syn_df
: A data frame pulled from penguins_postsynth
. This
is used to demonstrate how library(syntheval)
works with output from
a synthesizer different than library(tidysynthesis)
.Functions like util_proportions()
and util_moments()
have different
behaviors for postsynth
objects and data frames. By default, they only
show synthesized variables for postsynth
objects and show all common
variables for data frames. The common_vars
and synth_vars
arguments
can change this behavior.
util_proportions()
compares the proportions of classes from
categorical variables in the original data and synthetic data.
util_proportions(
postsynth = penguins_postsynth,
data = penguins_conf
)
# A tibble: 2 × 5
variable class synthetic original difference
<chr> <fct> <dbl> <dbl> <dbl>
1 sex female 0.529 0.495 0.0330
2 sex male 0.471 0.505 -0.0330
All common variables are shown when using a data frame.
util_proportions(
postsynth = penguins_syn_df,
data = penguins_conf
)
# A tibble: 8 × 5
variable class synthetic original difference
<chr> <fct> <dbl> <dbl> <dbl>
1 island Biscoe 0.465 0.489 -0.0240
2 island Dream 0.414 0.369 0.0450
3 island Torgersen 0.120 0.141 -0.0210
4 sex female 0.529 0.495 0.0330
5 sex male 0.471 0.505 -0.0330
6 species Adelie 0.459 0.438 0.0210
7 species Chinstrap 0.234 0.204 0.0300
8 species Gentoo 0.306 0.357 -0.0511
util_moments()
compares the counts, means, standard deviations,
skewnesses, and kurtoses of the original data and synthetic data.
util_moments(
postsynth = penguins_postsynth,
data = penguins_conf
)
# A tibble: 20 × 6
variable statistic original synthetic difference proportion_difference
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 bill_length_mm count 3.33e+2 333 0 0
2 bill_length_mm mean 4.40e+1 43.5 -0.502 -0.0114
3 bill_length_mm sd 5.47e+0 5.54 0.0723 0.0132
4 bill_length_mm skewness 4.51e-2 0.0646 0.0195 0.432
5 bill_length_mm kurtosis -8.88e-1 -0.948 -0.0598 0.0674
6 bill_depth_mm count 3.33e+2 333 0 0
7 bill_depth_mm mean 1.72e+1 17.3 0.122 0.00712
8 bill_depth_mm sd 1.97e+0 1.89 -0.0762 -0.0387
9 bill_depth_mm skewness -1.49e-1 -0.278 -0.129 0.867
10 bill_depth_mm kurtosis -8.97e-1 -0.742 0.155 -0.172
11 flipper_length… count 3.33e+2 333 0 0
12 flipper_length… mean 2.01e+2 199. -1.70 -0.00847
13 flipper_length… sd 1.40e+1 13.9 -0.135 -0.00961
14 flipper_length… skewness 3.59e-1 0.611 0.253 0.705
15 flipper_length… kurtosis -9.65e-1 -0.704 0.261 -0.270
16 body_mass_g count 3.33e+2 333 0 0
17 body_mass_g mean 4.21e+3 4162. -45.0 -0.0107
18 body_mass_g sd 8.05e+2 783. -22.3 -0.0277
19 body_mass_g skewness 4.70e-1 0.655 0.185 0.394
20 body_mass_g kurtosis -7.40e-1 -0.388 0.353 -0.477
util_totals()
is similar to util_moments()
but looks at counts and
totals.
util_totals(
postsynth = penguins_postsynth,
data = penguins_conf
)
# A tibble: 8 × 6
variable statistic original synthetic difference proportion_difference
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 bill_length_mm count 333 333 0 0
2 bill_length_mm total 14650. 14483. -167. -0.0114
3 bill_depth_mm count 333 333 0 0
4 bill_depth_mm total 5716. 5757. 40.7 0.00712
5 flipper_length_… count 333 333 0 0
6 flipper_length_… total 66922 66355 -567 -0.00847
7 body_mass_g count 333 333 0 0
8 body_mass_g total 1400950 1385950 -15000 -0.0107
util_percentiles()
compares percentiles from the original data and
synthetic data. The default percentiles are c(0.1, 0.5, 0.9)
and can
be easily overwritten.
util_percentiles(
postsynth = penguins_postsynth,
data = penguins_conf,
probs = c(0.5, 0.8)
)
# A tibble: 8 × 6
p variable original synthetic difference proportion_difference
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 0.5 bill_length_mm 44.5 43.5 -1 -0.0225
2 0.8 bill_length_mm 49.5 49.1 -0.440 -0.00889
3 0.5 bill_depth_mm 17.3 17.5 0.200 0.0116
4 0.8 bill_depth_mm 18.9 19.0 0.0600 0.00317
5 0.5 flipper_length_mm 197 195 -2 -0.0102
6 0.8 flipper_length_mm 215 214 -1 -0.00465
7 0.5 body_mass_g 4050 3950 -100 -0.0247
8 0.8 body_mass_g 4990 4850 -140 -0.0281
The functions are designed to work well with library(ggplot2)
.
util_percentiles(
postsynth = penguins_postsynth,
data = penguins_conf,
probs = seq(0.01, 0.99, 0.01)
) |>
pivot_longer(
cols = c(original, synthetic),
names_to = "source",
values_to = "value"
) |>
ggplot(aes(x = p, y = value, color = source)) +
geom_line() +
facet_wrap(~ variable, scales = "free")
util_ks_distance()
shows the Kolmogorov-Smirnov distance between the
original distribution and synthetic distribution for numeric variables.
The function also returns the point(s) of the maximum distance.
util_ks_distance(
postsynth = penguins_syn_df,
data = penguins_conf
)
# A tibble: 14 × 3
variable value D
<chr> <dbl> <dbl>
1 bill_length_mm 38.7 0.0601
2 bill_depth_mm 16.7 0.0511
3 bill_depth_mm 16.7 0.0511
4 bill_depth_mm 16.8 0.0511
5 bill_depth_mm 16.8 0.0511
6 flipper_length_mm 196. 0.0781
7 flipper_length_mm 196. 0.0781
8 flipper_length_mm 197. 0.0781
9 flipper_length_mm 197. 0.0781
10 flipper_length_mm 197. 0.0781
11 body_mass_g 4359. 0.0480
12 body_mass_g 4370. 0.0480
13 body_mass_g 4381. 0.0480
14 body_mass_g 4392. 0.0480
util_co_occurrence()
differences the lower triangles of co-occurrence
matrices calculated on numeric variables in the original data and
synthetic data.
co_occurrence <- util_co_occurrence(
postsynth = penguins_postsynth,
data = penguins_conf
)
co_occurrence$co_occurrence_difference
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm NA NA NA NA
bill_depth_mm 0 NA NA NA
flipper_length_mm 0 0 NA NA
body_mass_g 0 0 0 NA
The function returns the MAE for co-occurrences, which provides a sense of the median error between the original and synthetic data. The function also returns the RMSE for co-occurrences, which provides a sense of the average error between the original data and synthetic data.
co_occurrence$co_occurrence_difference_mae
[1] 0
co_occurrence$co_occurrence_difference_rmse
[1] 0
All observations have non-zero bill_length_mm
, bill_depth_mm
,
flipper_length_mm
, and body_mass_g
. util_co_occurrence()
is most
useful for economic variables like income and wealth where 0
is a
common value.
util_corr_fit()
differences the lower triangles of correlation
matrices calculated on numeric variables in the original data and
synthetic data.
corr_fit <- util_corr_fit(
postsynth = penguins_postsynth,
data = penguins_conf
)
round(corr_fit$correlation_difference, digits = 3)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm NA NA NA NA
bill_depth_mm -0.069 NA NA NA
flipper_length_mm -0.003 -0.048 NA NA
body_mass_g 0.031 -0.011 -0.08 NA
The function returns the MAE for correlation coefficients, which provides a sense of the median error between the original and synthetic data. The function also returns the RMSE for the correlation coefficients, which provides a sense of the average error between the original synthetic data.
corr_fit$correlation_difference_mae
[1] 0.04034565
corr_fit$correlation_difference_rmse
[1] 0.04922324
util_ci_overlap()
compares a linear regression models estimated on the
original data and synthetic data. formula
specifies the functional
form of the regression model.
ci_overlap <- util_ci_overlap(
postsynth = penguins_postsynth,
data = penguins_conf,
formula = body_mass_g ~ bill_length_mm + sex
)
$ci_overlap()
summarizes each coefficient including how much the
confidence intervals overlap, if the signs match, and if the statistical
significance matches.
ci_overlap$ci_overlap
# A tibble: 3 × 8
term overlap coef_diff std_coef_diff sign_match significance_match ss_match
<chr> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
1 (Inter… 0.963 -27.8 -0.0978 TRUE TRUE TRUE
2 bill_l… 0.965 0.917 0.138 TRUE TRUE TRUE
3 sexmale 0.951 -14.0 -0.192 TRUE TRUE TRUE
# ℹ 1 more variable: sso_match <lgl>
$coefficient
provides detail for each coefficient and is useful for
data visualization.
ci_overlap$coefficient
# A tibble: 6 × 8
source term estimate std.error statistic p.value conf.low conf.high
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 original (Intercept) 746. 285. 2.62 9.22e- 3 186. 1306.
2 original bill_lengt… 74.0 6.67 11.1 1.51e-24 60.9 87.1
3 original sexmale 405. 72.8 5.57 5.43e- 8 262. 548.
4 synthetic (Intercept) 718. 264. 2.72 6.77e- 3 200. 1237.
5 synthetic bill_lengt… 74.9 6.25 12.0 8.98e-28 62.7 87.2
6 synthetic sexmale 391. 69.2 5.65 3.42e- 8 255. 527.
ci_overlap$coefficient |>
ggplot(aes(x = estimate, xmin = conf.low, xmax = conf.high, y = term, color = source)) +
geom_pointrange(alpha = 0.5, position = position_dodge(width = 0.5)) +
labs(
title = "The Synthesizer Recreates the Point Estimates and Confidence Intervals",
subtitle = "Regression Confidence Interval Overlap"
)
Discriminant-based metrics build models to predict if an observation is original or synthetic and then evaluate those model predictions. Ideally, it should be difficult for a model to distinguish, or discriminate, between original observations and synthetic observations.
library(tidymodels)
can be used to generate propensities (the
estimated probability than observation is synthetic).split = FALSE
.Discriminant-based metrics are built a discrimination
object created
by discrimination()
.
disc1 <- discrimination(postsynth = penguins_postsynth, data = penguins_conf)
Next, we use library(tidymodels)
to specify a model. We recommend the
tidymodels tutorial to learn more.
library(tidymodels)
rpart_rec <- recipe(
.source_label ~ .,
data = disc1$combined_data
)
rpart_mod <- decision_tree(cost_complexity = 0.01) |>
set_mode(mode = "classification") |>
set_engine(engine = "rpart")
Next, we fit the model to the data to generate predicted probabilities.
disc1 <- disc1 |>
add_propensities(
recipe = rpart_rec,
spec = rpart_mod
)
At this point, we can use
add_discriminator_auc()
to add the ROC AUC for the predicted
probabilitiesadd_specks()
to add SPECKS for the predicted probabilitiesadd_pmse()
to add pMSE for the predicted probabilitiesadd_pmse_ratio(times = 25)
to add the pMSE ratio using the pMSE
model and 25 bootstrap samplesdisc1 |>
add_discriminator_auc() |>
add_specks() |>
add_pmse() |>
add_pmse_ratio(times = 25)
$combined_data
# A tibble: 666 × 8
.source_label species island sex bill_length_mm bill_depth_mm
<fct> <fct> <fct> <fct> <dbl> <dbl>
1 original Adelie Torgersen male 39.1 18.7
2 original Adelie Torgersen female 39.5 17.4
3 original Adelie Torgersen female 40.3 18
4 original Adelie Torgersen female 36.7 19.3
5 original Adelie Torgersen male 39.3 20.6
6 original Adelie Torgersen female 38.9 17.8
7 original Adelie Torgersen male 39.2 19.6
8 original Adelie Torgersen female 41.1 17.6
9 original Adelie Torgersen male 38.6 21.2
10 original Adelie Torgersen male 34.6 21.1
# ℹ 656 more rows
# ℹ 2 more variables: flipper_length_mm <dbl>, body_mass_g <dbl>
$propensities
# A tibble: 666 × 10
.pred_synthetic .source_label .sample species island sex bill_length_mm
<dbl> <fct> <chr> <fct> <fct> <fct> <dbl>
1 0.0556 original training Adelie Torgersen male 39.1
2 0.355 original training Adelie Torgersen fema… 39.5
3 0.636 original training Adelie Torgersen fema… 40.3
4 0.375 original training Adelie Torgersen fema… 36.7
5 0.76 original testing Adelie Torgersen male 39.3
6 0.355 original training Adelie Torgersen fema… 38.9
7 0.375 original training Adelie Torgersen male 39.2
8 0.355 original training Adelie Torgersen fema… 41.1
9 0.4 original training Adelie Torgersen male 38.6
10 0.789 original testing Adelie Torgersen male 34.6
# ℹ 656 more rows
# ℹ 3 more variables: bill_depth_mm <dbl>, flipper_length_mm <dbl>,
# body_mass_g <dbl>
$discriminator
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
n= 498
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 498 249 synthetic (0.50000000 0.50000000)
2) bill_length_mm< 34.8 19 4 synthetic (0.78947368 0.21052632) *
3) bill_length_mm>=34.8 479 234 original (0.48851775 0.51148225)
6) bill_depth_mm>=18.85 99 44 synthetic (0.55555556 0.44444444)
12) bill_length_mm< 50.35 85 32 synthetic (0.62352941 0.37647059)
24) species=Chinstrap 21 2 synthetic (0.90476190 0.09523810) *
25) species=Adelie 64 30 synthetic (0.53125000 0.46875000)
50) flipper_length_mm< 192.5 40 15 synthetic (0.62500000 0.37500000)
100) bill_length_mm>=38.9 25 6 synthetic (0.76000000 0.24000000) *
101) bill_length_mm< 38.9 15 6 original (0.40000000 0.60000000) *
51) flipper_length_mm>=192.5 24 9 original (0.37500000 0.62500000) *
13) bill_length_mm>=50.35 14 2 original (0.14285714 0.85714286) *
7) bill_depth_mm< 18.85 380 179 original (0.47105263 0.52894737)
14) bill_length_mm>=50.55 45 17 synthetic (0.62222222 0.37777778) *
15) bill_length_mm< 50.55 335 151 original (0.45074627 0.54925373)
30) bill_length_mm< 46.05 218 107 original (0.49082569 0.50917431)
60) bill_depth_mm< 18.25 177 83 synthetic (0.53107345 0.46892655)
120) bill_depth_mm>=13.75 169 77 synthetic (0.54437870 0.45562130)
240) bill_length_mm>=36.55 142 61 synthetic (0.57042254 0.42957746)
480) bill_length_mm< 38.35 27 8 synthetic (0.70370370 0.29629630) *
481) bill_length_mm>=38.35 115 53 synthetic (0.53913043 0.46086957)
962) bill_length_mm>=41.65 73 29 synthetic (0.60273973 0.39726027) *
963) bill_length_mm< 41.65 42 18 original (0.42857143 0.57142857)
1926) flipper_length_mm>=193.5 11 4 synthetic (0.63636364 0.36363636) *
1927) flipper_length_mm< 193.5 31 11 original (0.35483871 0.64516129) *
241) bill_length_mm< 36.55 27 11 original (0.40740741 0.59259259) *
121) bill_depth_mm< 13.75 8 2 original (0.25000000 0.75000000) *
61) bill_depth_mm>=18.25 41 13 original (0.31707317 0.68292683)
122) flipper_length_mm>=191.5 23 11 synthetic (0.52173913 0.47826087)
244) bill_depth_mm>=18.55 10 2 synthetic (0.80000000 0.20000000) *
245) bill_depth_mm< 18.55 13 4 original (0.30769231 0.69230769) *
123) flipper_length_mm< 191.5 18 1 original (0.05555556 0.94444444) *
31) bill_length_mm>=46.05 117 44 original (0.37606838 0.62393162)
62) flipper_length_mm>=229.5 9 3 synthetic (0.66666667 0.33333333) *
63) flipper_length_mm< 229.5 108 38 original (0.35185185 0.64814815)
126) flipper_length_mm< 214.5 54 24 original (0.44444444 0.55555556)
252) body_mass_g>=3925 26 10 synthetic (0.61538462 0.38461538) *
253) body_mass_g< 3925 28 8 original (0.28571429 0.71428571) *
127) flipper_length_mm>=214.5 54 14 original (0.25925926 0.74074074) *
$discriminator_auc
# A tibble: 2 × 4
.sample .metric .estimator .estimate
<fct> <chr> <chr> <dbl>
1 training roc_auc binary 0.743
2 testing roc_auc binary 0.461
$pmse
# A tibble: 2 × 4
.source .pmse .null_pmse .pmse_ratio
<fct> <dbl> <dbl> <dbl>
1 training 0.0465 0.0327 1.42
2 testing 0.0483 0.0325 1.49
$specks
# A tibble: 2 × 2
.source .specks
<fct> <dbl>
1 training 0.386
2 testing 0.0833
attr(,"class")
[1] "discrimination"
Finally, we can look at variable importance and the decision tree from our discriminator.
library(vip)
library(rpart.plot)
disc1$discriminator |>
extract_fit_parsnip() |>
vip()
disc1$discriminator$fit$fit$fit |>
prp()
Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
To silence this warning:
Call prp with roundint=FALSE,
or rebuild the rpart model with model=TRUE.
Let’s repeat the workflow from above with LASSO logistic regression and hyperparameter tuning.
# create discrimination
disc2 <- discrimination(postsynth = penguins_postsynth, data = penguins_conf)
# create a recipe that includes 2nd-degree polynomials, dummy variables, and
# standardization
lasso_rec <- recipe(
.source_label ~ .,
data = disc2$combined_data
) |>
step_poly(all_numeric_predictors(), degree = 2) |>
step_dummy(all_nominal_predictors()) |>
step_normalize(all_predictors())
# create the model
lasso_mod <- logistic_reg(
penalty = tune(),
mixture = 1
) |>
set_engine(engine = "glmnet") |>
set_mode(mode = "classification")
# create a tuning grid
lasso_grid <- grid_regular(penalty(), levels = 10)
# add the propensities
disc2 <- disc2 |>
add_propensities_tuned(
recipe = lasso_rec,
spec = lasso_mod,
grid = lasso_grid
)
# calculate metrics
disc2 |>
add_discriminator_auc() |>
add_specks() |>
add_pmse() |>
add_pmse_ratio(times = 25)
$combined_data
# A tibble: 666 × 8
.source_label species island sex bill_length_mm bill_depth_mm
<fct> <fct> <fct> <fct> <dbl> <dbl>
1 original Adelie Torgersen male 39.1 18.7
2 original Adelie Torgersen female 39.5 17.4
3 original Adelie Torgersen female 40.3 18
4 original Adelie Torgersen female 36.7 19.3
5 original Adelie Torgersen male 39.3 20.6
6 original Adelie Torgersen female 38.9 17.8
7 original Adelie Torgersen male 39.2 19.6
8 original Adelie Torgersen female 41.1 17.6
9 original Adelie Torgersen male 38.6 21.2
10 original Adelie Torgersen male 34.6 21.1
# ℹ 656 more rows
# ℹ 2 more variables: flipper_length_mm <dbl>, body_mass_g <dbl>
$propensities
# A tibble: 666 × 10
.pred_synthetic .source_label .sample species island sex bill_length_mm
<dbl> <fct> <chr> <fct> <fct> <fct> <dbl>
1 0.5 original training Adelie Torgersen male 39.1
2 0.5 original training Adelie Torgersen fema… 39.5
3 0.5 original training Adelie Torgersen fema… 40.3
4 0.5 original training Adelie Torgersen fema… 36.7
5 0.5 original testing Adelie Torgersen male 39.3
6 0.5 original testing Adelie Torgersen fema… 38.9
7 0.5 original training Adelie Torgersen male 39.2
8 0.5 original training Adelie Torgersen fema… 41.1
9 0.5 original training Adelie Torgersen male 38.6
10 0.5 original testing Adelie Torgersen male 34.6
# ℹ 656 more rows
# ℹ 3 more variables: bill_depth_mm <dbl>, flipper_length_mm <dbl>,
# body_mass_g <dbl>
$discriminator
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_poly()
• step_dummy()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = "binomial", alpha = ~1)
Df %Dev Lambda
1 0 0.00 0.041780
2 1 0.09 0.038070
3 1 0.16 0.034680
4 1 0.22 0.031600
5 2 0.27 0.028800
6 2 0.36 0.026240
7 2 0.43 0.023910
8 4 0.51 0.021780
9 4 0.58 0.019850
10 4 0.63 0.018080
11 4 0.68 0.016480
12 6 0.72 0.015010
13 6 0.78 0.013680
14 6 0.83 0.012460
15 6 0.87 0.011360
16 7 0.90 0.010350
17 7 0.94 0.009429
18 7 0.97 0.008592
19 7 0.99 0.007828
20 7 1.02 0.007133
21 8 1.06 0.006499
22 8 1.15 0.005922
23 8 1.22 0.005396
24 9 1.29 0.004916
25 9 1.34 0.004480
26 11 1.39 0.004082
27 11 1.44 0.003719
28 11 1.49 0.003389
29 11 1.53 0.003088
30 11 1.56 0.002813
31 11 1.58 0.002563
32 11 1.61 0.002336
33 11 1.62 0.002128
34 11 1.64 0.001939
35 11 1.65 0.001767
36 12 1.66 0.001610
37 12 1.68 0.001467
38 12 1.69 0.001337
39 12 1.70 0.001218
40 11 1.70 0.001110
41 11 1.71 0.001011
42 11 1.71 0.000921
43 11 1.72 0.000839
44 11 1.72 0.000765
45 11 1.72 0.000697
46 11 1.73 0.000635
...
and 6 more lines.
$discriminator_auc
# A tibble: 2 × 4
.sample .metric .estimator .estimate
<fct> <chr> <chr> <dbl>
1 training roc_auc binary 0.5
2 testing roc_auc binary 0.5
$pmse
# A tibble: 2 × 4
.source .pmse .null_pmse .pmse_ratio
<fct> <dbl> <dbl> <dbl>
1 training 1.23e-32 1.23e-32 1
2 testing 1.23e-32 1.23e-32 1
$specks
# A tibble: 2 × 2
.source .specks
<fct> <dbl>
1 training 4.86e-17
2 testing 6.94e-17
attr(,"class")
[1] "discrimination"
# look at variable importance
library(vip)
disc2$discriminator |>
extract_fit_parsnip() |>
vip()
Many utility metrics include a group_by
argument to group the metrics
by group during calculation. For example, this code calculates moments
by species.
util_moments(
postsynth = penguins_postsynth,
data = penguins_conf,
group_by = species
)
# A tibble: 60 × 7
species variable statistic original synthetic difference
<fct> <fct> <fct> <dbl> <dbl> <dbl>
1 Adelie bill_length_mm count 146 153 7
2 Chinstrap bill_length_mm count 68 78 10
3 Gentoo bill_length_mm count 119 102 -17
4 Adelie bill_length_mm mean 38.8 38.5 -0.294
5 Chinstrap bill_length_mm mean 48.8 47.3 -1.56
6 Gentoo bill_length_mm mean 47.6 48.0 0.475
7 Adelie bill_length_mm sd 2.66 2.93 0.267
8 Chinstrap bill_length_mm sd 3.34 3.07 -0.273
9 Gentoo bill_length_mm sd 3.11 3.40 0.299
10 Adelie bill_length_mm skewness 0.156 0.505 0.349
# ℹ 50 more rows
# ℹ 1 more variable: proportion_difference <dbl>
Many utility metrics include a weight_var
argument to use weighted
statistics during calculation. For example, this code weights the
moments by the body weight of the penguins.
util_moments(
postsynth = penguins_postsynth,
data = penguins_conf,
weight_var = body_mass_g
)
# A tibble: 20 × 6
variable statistic original synthetic difference proportion_difference
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 bill_length_mm count 1.40e+6 1.39e+6 -1.5 e+4 -0.0107
2 bill_length_mm mean 4.46e+1 4.41e+1 -4.71e-1 -0.0106
3 bill_length_mm sd 5.38e+0 5.54e+0 1.60e-1 0.0297
4 bill_length_mm skewness -7.39e-2 -5.74e-2 1.64e-2 -0.222
5 bill_length_mm kurtosis -7.89e-1 -9.15e-1 -1.26e-1 0.160
6 bill_depth_mm count 1.40e+6 1.39e+6 -1.5 e+4 -0.0107
7 bill_depth_mm mean 1.70e+1 1.71e+1 1.28e-1 0.00754
8 bill_depth_mm sd 2.01e+0 1.94e+0 -7.07e-2 -0.0351
9 bill_depth_mm skewness 1.05e-2 -1.37e-1 -1.47e-1 -14.0
10 bill_depth_mm kurtosis -1.00e+0 -9.17e-1 8.70e-2 -0.0867
11 flipper_length… count 1.40e+6 1.39e+6 -1.5 e+4 -0.0107
12 flipper_length… mean 2.03e+2 2.01e+2 -1.97e+0 -0.00970
13 flipper_length… sd 1.44e+1 1.46e+1 1.60e-1 0.0111
14 flipper_length… skewness 1.48e-1 3.99e-1 2.51e-1 1.70
15 flipper_length… kurtosis -1.14e+0 -1.04e+0 1.08e-1 -0.0940
16 body_mass_g count 1.40e+6 1.39e+6 -1.5 e+4 -0.0107
17 body_mass_g mean 4.36e+3 4.31e+3 -5.19e+1 -0.0119
18 body_mass_g sd 8.26e+2 8.17e+2 -9.84e+0 -0.0119
19 body_mass_g skewness 2.69e-1 4.64e-1 1.95e-1 0.724
20 body_mass_g kurtosis -9.70e-1 -7.26e-1 2.44e-1 -0.251
Most commonly, weight_var
is used when synthesizing data from surveys.
Contact Aaron R. Williams with feedback or questions.