Open emcfalls opened 9 months ago
Hey @emcfalls ! Thank you for the very helpful write-up and for sharing your ideas (and also the picture :) ). I have a few thoughts:
.group_by
argument is not provided, so, either way, a user would be returned a named list where the values of the list are a data.frame
. Second, I can imagine this being more easy for a user to work with than nested lists (I suspect a user would just unnest them and put them in a dataframe anyway).group_by
argument, our strategy is to return dataframes with a new column for the column containing the subgroup and row bind different dataframes that have the results for each subgroup together to make one larger dataframe (in other words so they all look like the second chart Elyse drew). I looked through all of the utility metrics, and it appears that this strategy is viable for them. For the disclosure metrics, it looks like we generally return either a single float or a named list of floats. This would mean we would have to change that API, but we will have to change it regardless for any functions that get the group_by
argument. This is a bummer but is probably worth it.One alternative to the two options presented here would be to make any functions that get a group_by
argument to take another optional parameter of group
(or something similar):
filter(group_by_col == group)
for both the synthetic and original data. One other note as a total aside, markdown is supported on Github issues, so you can get inline code
and
code chunks
with one and three tic marks, respectively.
@Deckart2 I agree with not changing the format of the output and keeping it consistent with the other functions that use groupby! I also like your idea of using a group parameter so instead of returning the results for all groups we just return it for one. I think that would be the easiest as far as keeping the same function output, but I'm thinking it may be tedious for users who want to look at multiple groups. I think a best of both worlds would be to have both parameters (groupby and group), but that may be overkill.
Sounds right to me, and I think it could be totally okay to have a groupby and group argument. I also agree it could be tedious, but if we go this route, we could add some documentation that could show how to do it in a few lines of code.
It may be harder than this, but at its core, we would need to do something like:
groups <- synthetic_data$group_var |> unique()
results_by_group <- map_dfr(.f = util_corr_fit, .x = groups, ..., groupby = group_var)
Anyway, this is really thoughtful, and excited to discuss it synchronously with you @emcfalls and hear @awunderground 's thoughts :)!
Brief Notes from Convo with Aaron: Inspire by collect metrics from tidymodels: see documentation here - https://awunderground.github.io/data-science-for-public-policy2/11_ensembling.html Go away from list of lists and instead go for tidy data of tibbles.
This extension will allow users to return correlation data for the numerical columns in their synthesis by a certain variable (i.e., gender, age, race, etc.). Therefore, users can assess the performance of their synthesis based how it preserves multivariate relationships for different subgroups in the population. I don't if we would want to group by multiple variables due to the complexity.
Right now, the util_corr_fit() function returns a list
I have two ideas for this extension