Review and consider applying `?dplyr_extending` for `epi_df` - Githubissues

cmu-delphi / epiprocess

Tools for basic signal processing in epidemiology

https://cmu-delphi.github.io/epiprocess/

Other

13 stars 9 forks source link

Review and consider applying `?dplyr_extending` for `epi_df` #223

Open brookslogan opened 2 years ago

brookslogan commented 2 years ago

I believe that we implemented epi_df without the use of ?dplyr_extending. We may have:

Missing S3 implementations
Unnecessary S3 implementations
Noncompliant behavior

We should review ?dplyr_extending to see if it hints at bugs, enhancements, code simplifications, etc. (E.g., the mention of [ might have hinted that we would want a [ implementation before we discovered it on our own.)

brookslogan commented 1 year ago

Might have ran into one of these cases in #64 where grouped epi_dfs being bound together lost their epi_dfness. Implementing a dplyr_reconstruct.epi_df based on [.epi_df seems to partially resolve. But note this exacerbates #242 (which underlies #213).

May also related to #117, cmu-delphi/epipredict#51.

brookslogan commented 1 year ago

A potential issue with ?dplyr_extending guidance: arrange on grouped epi_dfs drops epi_dfness. We need to implement dplyr_row_slice as well. (We also need to implement other methods interacting with grouping: group_by, ungroup, group_modify [or maybe not; an appropriate group_data + any other required extensions of grouped_df methods (see below) might get this "for free"/"properly"]; as well as unnest; we may be missing some others.)

brookslogan commented 1 year ago

There are a few other issues with grouped epi dfs in the current dev / lcb/make_epix_slide_more_like_reframe branches; see #113, #298. Note:

Part of this may be due to missing a group_data method, alluded to in ?dplyr_extending. We'd want either/both (a) "epi_df" before "grouped_df" in the class vector to be able to dispatch to group_data.epi_df instead of group_data.grouped_df, or (b) to mess with the class of the "groups" attr generated by group_by (doesn't sound great.)
Note in comments of #113 that we want "grouped_df" before "epi_df" due to some pillar methods for grouped_df expecting it to be before more base-like classes in the class vector.

1(a) and 2 (and maybe more points) seem to conflict. One potential resolution:

Make grouped epi_dfs use the class c("grouped_edf", "grouped_df", "epi_df", "tbl_df", "tbl", "data.frame"). That's also suggested in this older thread.

brookslogan commented 1 year ago

While applying the "potential resolution" above seems to help in some aspects, it also loses epi_df-ness in a lot of other situations. The problem appears to be that grouped_df's methods don't really respect dplyr_extending for grouped_* classes in front of it. So we'd need to look at methods(,"grouped_df") and/or methods(,"grouped_{somethingelse}") and make sure to override/mirror the things we want. [2024-09-24: perhaps this wasn't/isn't the case; it seems you can get these ops to work as long as you make sure not to lose your head class after any NextMethods in dplyr_extending impls, including e.g. names<-, and this may be sufficient. Also, looking at grouped_df S3 methods might not have even worked, as there might be some non-S3 functions manually branching on grouped_df-ness.] Related: #145. [We'll need to look in some packages beyond dplyr for these, e.g., for vctrs::restore.]

[Same approach would need to be applied for rowwise edfs. Plus, in any approach, we will want to review tidyr for more methods that may need wrapped like, e.g., unnest.]

brookslogan commented 1 year ago

More recent reference: https://github.com/tidyverse/dplyr/issues/5480

Also, I'm not sure if the approaches with a separate grouped_{something} class approach were all limited to tidyverts and not tidyverse. It seems like tidyverse things like dtplyr and sf might use either (a) manual delegation rather than via NextMethod (and have custom print with as_tibble, format, [-1] to remove header line(s)) or (b) the single-class-in-front-of-grouped_df approach with some additional methods implemented (and just has the same "# A tibble: [...]" line sticking around).

brookslogan commented 1 year ago

Progress is in lcb/round-out-grouped-epi_dfs. However, it already seems to have some problems with summarize drop_last seeing time_value as an additional grouping variable when we are only grouping by geo_value, and thus grouping results by geo_value by default when outputting 1 row per computation [within epi_cor. Seems like we can't just use reclass or dplyr_reconstruct everywhere without thinking, and/or there's an issue with the current dplyr_reconstruct.grouped_edf; dplyr_reconstruct.grouped_df tries to take grouping from template (intersecting with available columns in data)]. Also needs further, more focused, tests that it actually fixes what we wanted to fix. Some other existing test(s) are broken from not dropping data.table attributes that are now being kept around by group_modify result assembly in epix_slide (and kept inside the computations by as_tibble).

brookslogan commented 1 year ago

The questions raised by how to reconstruct a grouped summarize might tie into questions about when epix_slide should output an epi_df, when geo_value might either come from "outside", when the the slide is grouped by geo_value, or the "inside", when the slide isn't grouped but the user computations output epi_dfs (with geo_values). A useful concept might be an epi_df_selection/epi_df_slice/sub_epi_df or an epi_df_group, which would be a special class we'd use, e.g., as the first or both args in group_modify computations, but only decaying if essential columns don't appear in either arg, rather than decaying when they're gone from the first arg. First, this special class would solve us having to use .keep = TRUE to keep a relevant class passed in the first arg of group_modify, and second, it might help in other summarization situations to avoid having to attempt reconstructing things against a template that might not be very similar (since summarization can alter so much). [Alternatively, there could be an epi_df_groups class output from group_data and hopefully retained until it's time to unnest, etc.]

brookslogan commented 1 year ago

Another point where things seem unclear: should we drop metadata in an as_tibble.epi_df method? This seems like it would correspond to grouped_df, rowwise_df, and other classes' behavior of dropping groupings (though the former is viewed in vctrs as a more general concept than ordinary tibbles, while epi_dfs are a narrower concept). And we have the metadata dropped when as_tibble-ing grouped epi_dfs currently anyway. I'll likely implement metadata-dropping in #311.

brookslogan commented 2 months ago

Having an epi_df_slice/epi_df_selection/sub_epi_df / epi_df_group / epi_df_view class would also allow us to disable accidental use of mean/sum on .x in slide computations when .x$value was meant instead. By simply providing an S3 method that complains. Currently we lose epi-dfness when doing slides grouped by geos, so doing the same for epi_dfs isn't going to catch much. [Though alternatively, we may be able to use output column de-duplication on unnesting (rather than errors) + do keep = TRUE for epi_slides as we do epix_slides. Then epi_df impls should catch it.]

brookslogan commented 3 weeks ago

Some notes:

Our current application of ?dplyr_extending is awkward, routing many/all core ops through a dplyr_reconstruct with all validation checks. Experimenting with a "real" application in brookslogan/dplyr.extending.test
We avoid some issues by using .keep = TRUE in group_map ops both epix_slide and epi_slide now/soon. But if a user directly uses group_modify with the default .keep = FALSE, then they could run into annoyances, so we should at least maintain a group_modify method to fix up there. But there's less need for a slice/group/view class given these adjustments.

brookslogan commented 2 weeks ago

Requiring key uniqueness makes geo/time aggregation awkward; it'd be nice if we could edf %>% group_by(geo_value = state_of(geo_value), time_value = week_of(time_value)) %>% summarize(value = sum(value)) and get an epi_df out