Allow util_corr_fit() to group by a variable

emcfalls commented 9 months ago

This extension will allow users to return correlation data for the numerical columns in their synthesis by a certain variable (i.e., gender, age, race, etc.). Therefore, users can assess the performance of their synthesis based how it preserves multivariate relationships for different subgroups in the population. I don't if we would want to group by multiple variables due to the complexity.

Right now, the util_corr_fit() function returns a list

return(
    list(
      correlation_original = original_lt,
      correlation_synthetic = synthetic_lt,
      correlation_difference = difference_lt,
      correlation_fit = correlation_fit,
      correlation_difference_mae = correlation_difference_mae,
      correlation_difference_rmse = correlation_difference_rmse
    )

I have two ideas for this extension

When grouping by another variable, the function would return a list of lists. Each element of the list would be the correlation list data for each subgroup of that variable Example for variable sex:

return(
    list(
      male = list(
      correlation_original_m = original_lt_m,
      correlation_synthetic_m = synthetic_lt_m,
      correlation_difference_m = difference_lt_m,
      correlation_fit_m = correlation_fit_f,
      correlation_difference_mae_m = correlation_difference_mae_m,
      correlation_difference_rmse_m = correlation_difference_rmse_m
      ),
      female = list(
      correlation_original_f = original_lt_f,
      correlation_synthetic_f = synthetic_lt_f,
      correlation_difference_f = difference_lt_f,
      correlation_fit_f = correlation_fit_f,
      correlation_difference_mae_f = correlation_difference_mae_f,
      correlation_difference_rmse_f = correlation_difference_rmse_f
      )
  )

Instead of making a list of lists, we could include all of the correlations for each subgroup in one dataframe for correlation_original, correlation_synthetic, and correlation_difference_f. correlation_difference_f, correlation_difference_mae_f, and correlation_difference_rmse_f would be lists where each element corresponds to the metric for a specific subgroup Example correlation matrix for a dataframe with sex, income (numeric), and height (numeric)

Deckart2 commented 9 months ago

Hey @emcfalls ! Thank you for the very helpful write-up and for sharing your ideas (and also the picture :) ). I have a few thoughts:

I think both are viable alternatives, but I like option 2 better and would support us doing that.
I like option 2 for two reasons. 3. First, it maintains the same API experience as when a .group_by argument is not provided, so, either way, a user would be returned a named list where the values of the list are a data.frame. Second, I can imagine this being more easy for a user to work with than nested lists (I suspect a user would just unnest them and put them in a dataframe anyway).
One other thing I think is important is that we should maintain a consistent API across functions within syntheval. This would mean that for all functions to which you (or any of us) add a group_by argument, our strategy is to return dataframes with a new column for the column containing the subgroup and row bind different dataframes that have the results for each subgroup together to make one larger dataframe (in other words so they all look like the second chart Elyse drew). I looked through all of the utility metrics, and it appears that this strategy is viable for them. For the disclosure metrics, it looks like we generally return either a single float or a named list of floats. This would mean we would have to change that API, but we will have to change it regardless for any functions that get the group_by argument. This is a bummer but is probably worth it.

One alternative to the two options presented here would be to make any functions that get a group_by argument to take another optional parameter of group (or something similar):

The function would then return the exact same results as it does now, but just analyzes the one group, and you call the function multiple times. Internally, the function would work the same as it does now, but it would just start off by doing a filter(group_by_col == group) for both the synthetic and original data.
Then the user would iterate over all the values of the group in the group_by column.
The pro of this is that it maintains an identical API. The con is that this is not very elegant and users would have to put together result dataframes.

Deckart2 commented 9 months ago

One other note as a total aside, markdown is supported on Github issues, so you can get inline code and

code chunks

with one and three tic marks, respectively.

emcfalls commented 9 months ago

@Deckart2 I agree with not changing the format of the output and keeping it consistent with the other functions that use groupby! I also like your idea of using a group parameter so instead of returning the results for all groups we just return it for one. I think that would be the easiest as far as keeping the same function output, but I'm thinking it may be tedious for users who want to look at multiple groups. I think a best of both worlds would be to have both parameters (groupby and group), but that may be overkill.

Deckart2 commented 9 months ago

Sounds right to me, and I think it could be totally okay to have a groupby and group argument. I also agree it could be tedious, but if we go this route, we could add some documentation that could show how to do it in a few lines of code.

It may be harder than this, but at its core, we would need to do something like:

Identify all relevant groups: groups <- synthetic_data$group_var |> unique()
Call function for each group: results_by_group <- map_dfr(.f = util_corr_fit, .x = groups, ..., groupby = group_var)

Anyway, this is really thoughtful, and excited to discuss it synchronously with you @emcfalls and hear @awunderground 's thoughts :)!

Deckart2 commented 9 months ago

Brief Notes from Convo with Aaron: Inspire by collect metrics from tidymodels: see documentation here - https://awunderground.github.io/data-science-for-public-policy2/11_ensembling.html Go away from list of lists and instead go for tidy data of tibbles.

UrbanInstitute / syntheval

Allow util_corr_fit() to group by a variable #58