Open brookslogan opened 2 years ago
Might have ran into one of these cases in #64 where grouped epi_df
s being bound together lost their epi_df
ness. Implementing a dplyr_reconstruct.epi_df
based on [.epi_df
seems to partially resolve. But note this exacerbates #242 (which underlies #213).
May also related to #117, cmu-delphi/epipredict#51.
A potential issue with ?dplyr_extending
guidance: arrange
on grouped epi_df
s drops epi_df
ness. We need to implement dplyr_row_slice
as well. (We also need to implement other methods interacting with grouping: group_by
, ungroup
, group_modify
[or maybe not; an appropriate group_data
+ any other required extensions of grouped_df
methods (see below) might get this "for free"/"properly"]; as well as unnest
; we may be missing some others.)
There are a few other issues with grouped epi dfs in the current dev
/ lcb/make_epix_slide_more_like_reframe
branches; see #113, #298. Note:
group_data
method, alluded to in ?dplyr_extending
. We'd want either/both (a) "epi_df"
before "grouped_df"
in the class vector to be able to dispatch to group_data.epi_df
instead of group_data.grouped_df
, or (b) to mess with the class of the "groups"
attr generated by group_by
(doesn't sound great.)"grouped_df"
before "epi_df"
due to some pillar
methods for grouped_df
expecting it to be before more base-like classes in the class vector.1(a) and 2 (and maybe more points) seem to conflict. One potential resolution:
epi_df
s use the class c("grouped_edf", "grouped_df", "epi_df", "tbl_df", "tbl", "data.frame")
. That's also suggested in this older thread.While applying the "potential resolution" above seems to help in some aspects, it also loses epi_df
-ness in a lot of other situations. The problem appears to be that grouped_df
's methods don't really respect dplyr_extending
for grouped_*
classes in front of it. So we'd need to look at methods(,"grouped_df")
and/or methods(,"grouped_{somethingelse}")
and make sure to override/mirror the things we want. [2024-09-24: perhaps this wasn't/isn't the case; it seems you can get these ops to work as long as you make sure not to lose your head class after any NextMethods in dplyr_extending impls, including e.g. names<-
, and this may be sufficient. Also, looking at grouped_df S3 methods might not have even worked, as there might be some non-S3 functions manually branching on grouped_df-ness.] Related: #145. [We'll need to look in some packages beyond dplyr
for these, e.g., for vctrs::restore
.]
[Same approach would need to be applied for rowwise edfs. Plus, in any approach, we will want to review tidyr for more methods that may need wrapped like, e.g., unnest
.]
More recent reference: https://github.com/tidyverse/dplyr/issues/5480
Also, I'm not sure if the approaches with a separate grouped_{something}
class approach were all limited to tidyverts and not tidyverse. It seems like tidyverse things like dtplyr and sf might use either (a) manual delegation rather than via NextMethod
(and have custom print
with as_tibble
, format
, [-1]
to remove header line(s)) or (b) the single-class-in-front-of-grouped_df approach with some additional methods implemented (and just has the same "# A tibble: [...]" line sticking around).
Progress is in lcb/round-out-grouped-epi_dfs
. However, it already seems to have some problems with summarize drop_last seeing time_value as an additional grouping variable when we are only grouping by geo_value, and thus grouping results by geo_value by default when outputting 1 row per computation [within epi_cor. Seems like we can't just use reclass or dplyr_reconstruct everywhere without thinking, and/or there's an issue with the current dplyr_reconstruct.grouped_edf; dplyr_reconstruct.grouped_df tries to take grouping from template (intersecting with available columns in data)]. Also needs further, more focused, tests that it actually fixes what we wanted to fix. Some other existing test(s) are broken from not dropping data.table attributes that are now being kept around by group_modify result assembly in epix_slide (and kept inside the computations by as_tibble).
The questions raised by how to reconstruct a grouped summarize might tie into questions about when epix_slide
should output an epi_df
, when geo_value
might either come from "outside", when the the slide is grouped by geo_value
, or the "inside", when the slide isn't grouped but the user computations output epi_df
s (with geo_value
s). A useful concept might be an epi_df_selection
/epi_df_slice
/sub_epi_df
or an epi_df_group
, which would be a special class we'd use, e.g., as the first or both args in group_modify
computations, but only decaying if essential columns don't appear in either arg, rather than decaying when they're gone from the first arg. First, this special class would solve us having to use .keep = TRUE
to keep a relevant class passed in the first arg of group_modify
, and second, it might help in other summarization situations to avoid having to attempt reconstructing things against a template that might not be very similar (since summarization can alter so much). [Alternatively, there could be an epi_df_groups
class output from group_data
and hopefully retained until it's time to unnest, etc.]
Another point where things seem unclear: should we drop metadata in an as_tibble.epi_df
method? This seems like it would correspond to grouped_df
, rowwise_df
, and other classes' behavior of dropping groupings (though the former is viewed in vctrs as a more general concept than ordinary tibbles, while epi_df
s are a narrower concept). And we have the metadata dropped when as_tibble
-ing grouped epi_df
s currently anyway. I'll likely implement metadata-dropping in #311.
Having an epi_df_slice
/epi_df_selection
/sub_epi_df
/ epi_df_group
/ epi_df_view
class would also allow us to disable accidental use of mean
/sum
on .x
in slide computations when .x$value
was meant instead. By simply providing an S3 method that complains. Currently we lose epi-dfness when doing slides grouped by geos, so doing the same for epi_df
s isn't going to catch much. [Though alternatively, we may be able to use output column de-duplication on unnesting (rather than errors) + do keep = TRUE for epi_slides as we do epix_slides. Then epi_df impls should catch it.]
Some notes:
?dplyr_extending
is awkward, routing many/all core ops through a dplyr_reconstruct
with all validation checks. Experimenting with a "real" application in brookslogan/dplyr.extending.test.keep = TRUE
in group_map
ops both epix_slide
and epi_slide
now/soon. But if a user directly uses group_modify
with the default .keep = FALSE
, then they could run into annoyances, so we should at least maintain a group_modify
method to fix up there. But there's less need for a slice/group/view class given these adjustments.edf %>% group_by(geo_value = state_of(geo_value), time_value = week_of(time_value)) %>% summarize(value = sum(value))
and get an epi_df
out
I believe that we implemented
epi_df
without the use of?dplyr_extending
. We may have:We should review
?dplyr_extending
to see if it hints at bugs, enhancements, code simplifications, etc. (E.g., the mention of[
might have hinted that we would want a[
implementation before we discovered it on our own.)