Closed ryantibs closed 1 year ago
I feel like the Pro is pretty major, and the Con is pretty minor. But maybe I'm not representative.
My potential confusion:
For dplyr
, group_by()
followed by summarize()
results in an ungrouped object while it remains grouped after mutate()
. It feels to me that _slide()
is more like mutate()
then summarize()
, so maybe leaving it grouped isn't so odd? (With the caveat that all of this involves assignment to a result).
On the other hand, the grouping/not behaviour in dplyr
is one of those things that I often forget about, and I find myself erroneously assuming something is ungrouped. This has happened many times, so maybe I'm an outlier in terms of actually absorbing the dplyr
logic.
I think the drawback Ryan is pointing out is not that we need to group(....) %>% epix_slide(.....) %>% ungroup(....)
to get no groups, but rather, is this situation:
x = <ungrouped epi_archive>
z1 = x %>% epix_slide(<stuff1>)
y = x %>% group_by(grpvar) %>% epix_slide(<stuff2>)
z2 = x %>% epix_slide(<stuff1>) # different from z1!
epi_slide
is like mutate
, but epix_slide
is more like summarize
[except it still partially broadcasts]; while the former only adds columns, the latter will be dropping nonkey nongroup columns + outputting a different class result (epi_archive --> epi_df). I believe that, until recently, summarize
left things grouped anyway or wouldn't drop them completely; then it moved to .groups="drop_last"
+ a message by default; now it seems to be .groups="drop_last"
+ no message by default. So users should be used to worrying about what happens to their groups... but maybe we could provide a .groups
option with a noisy default as well?
[This con] would be [re]solved by moving away from R6+data.table for epi_archive and instead using list+lazy_dt, or having epi_archive be a non-reference-semantics wrapper on top of an R6 EpiArchive backend. We decided this move was low priority though.
Addressing/ameliorating the Con:
grouped_x = group_by(x, grpvar); grouped_x %>% epix_slide(<stuff1>); grouped_x %>% epix_slide(<stuff1>)
they'd get different answers. Based on how I've used group_by and summarize in the past, I think this would be less likely to encounter compared to situation I noted in the post above [.... but I'm not sure on second thought]. (((We could prevent this by warning or stopping if there is no group setting in epix_slide. (To use no groups they would have to do x %>% group_by() %>% epix_slide(.....)
or something like that.))))Oh, I see. Thanks Logan!
@lcbrooks @dajmcdon What is your current thinking on this?
Based on our conversation yesterday, it seems like we want to abide by the rule:
Functions applied to R6 objects should not be side-effectful. Only public member functions should be side-effectful.
So that means that we could go with something like Logan's "medium" solution: group_by()
should clone everything in the epi_archive
object EXCEPT the underlying data.table
, and return it as an grouped_epi_archive
object.
Are we OK with the fact that we don't clone the underlying data.table
? I think so, provided we record this very clearly in the documentation. Plus, we could think of this as not an "R6 special case handling", but a "data table special case handling". Data tables are not cloned, they are typically handled in memory.
Thoughts?
Thinking about this clone-based "medium" solution:
Just regular archive$clone()
might work; I think it's shallow by default (and don't know whether the deep clone feature actually recognizes data.table
s and does the special copy
operation that would be required). Eventually we may want to move to the "higher effort" solution... we might want to document that while currently, it will be pointing to the same DT as the original, this might change in the future to use something that will copy on write, so users shouldn't rely on this pointer behavior?
There's a bit of complication due to the combination of S3 and R6. A couple of decisions to make:
group_by
output S3 class c("grouped_epi_archive","R6")
, c("grouped_epi_archive", "epi_archive","R6")
, or c("epi_archive","R6")
?
epi_archive
implements and forwarding to the epi_archive
implementation. If users write an S3 implementation for an epi_archive
they will need to also write one for grouped_epi_archive
s. But we will also not accidentally automatically get non-grouped behavior for any methods that we forgot to write special grouped implementations for. group_by
output R6 class grouped_epi_archive
or epi_archive
(with grouping support built into epi_archive
)?
Top contenders are
c("grouped_epi_archive","R6")
, R6 class is "grouped_epi_archive"
c("grouped_epi_archive","epi_archive","R6")
, R6 class is "grouped_epi_archive"
c("epi_archive","R6")
, R6 class is "epi_archive"
A1 involves more S3 implementations, but could also catch some issues when we don't want to directly inherit epi_archive behavior. B1 is closest to dplyr, but might or might not confuse R6 users with the triple class. C2 might be a little confusing with group methods listed for ungrouped archives.
I can't see a clear winner here. Maybe A1 or B1 over C2.
@lcbrooks Thanks Logan for the super detailed proposal and analysis!
I'm in favor of B1. The only negative you point out is that it has a triple class, but I don't really see this as a negative to be honest.
Tagging @dajmcdon to see if he has any further thoughts, but barring that, I think to get the ball rolling, you could go ahead and implement this. Seems like it could be a good idea to address #67 in the same PR since it's a related issue.
Returning to this. I was trying to write a justification for B1 in a comment, and in doing so convinced myself to switch to A1, because looking at, e.g., the backcasting preprocessors and compactify, I think we're going to want to force ourselves&users be precise about when we're dealing with an ungrouped vs. a grouped epi_archive
. And the overhead is comparable to B1, because with B1, we still need to check every inherited function to make sure it's compatible; it's just that if we forget to implement a method in A1, we get errors about unimplemented methods, which can be worked around, and if we forget to specialize/disable in B1, we get invalid results. I'm also adding a safeguard against re-grouping a grouped epi_archive
. The epipredict
development should hopefully give us an idea of how this all works out.
I have what looks like a working implementation on lcb/grouped_epi_archive. I originally hoped to keep the group_by
argument to epix_slide
, but with a deprecation warning until we discussed what to do with it; that is what is currently implemented. However, with the group_by
function making this more like dplyr
, we have the dplyr
-based (and epi_slide
-based) expectation that the default grouping will be no groups, not the key minus version&time_value. So if we are already making a breaking change to the default epix_slide grouping, it might make sense to simultaneously make the breaking change of removing the group_by
parameter, especially since the existing approach adds a groups
parameter (mirroring summarize
's .groups
), which might get a little confusing next to group_by
. Assuming no objections, the remaining steps are:
summarize
group_by arg
(allow, but warn) ---> removegroup_by
arg being passed and raise error informing of interface change.epi_df
] add tidyselect helpers for the key columns, similar to how tidymodels has all_predictors()
group_by
... they are tidyselect rather than data masking, and don't allow definition of new columns%>% group_by
vs. $group_by
desired behaviorgrouped_epi_archive
is to print
, slide
, group_by
with .add=TRUE
, or ungroup
. We could keep it that way to indicate the only operations that are worth grouping for, or forward/implement more groupsplit-map-bind compatible methods.n
-> before
epix_slide
with epi_slide
and existing epix_slide
regarding whether grouping variables are accessible in .x
vs. put into .y
; ties into whether .x
is a valid epi_df
group_modify
to summarize
didn't include passing a dplyr::cur_group()
, so the group key isn't received. [ --- Addressed by moving back to group_modify
in epix_slide
implementation.]dplyr::cur_data_all()
includes the columns we'd need to keep to ensure the first arg to each f
call is an epi_df
, but actually doesn't keep the epi_df
class or metadata, and including these columns also differs from the current epi_slide
behavior, impacting the output class and metadata. [Might be an epi_df
S3 method issue.] -----> Make it (like) dplyr::cur_data()
, with epi_df
class if it's valid, else a tbl_df
. Check that this is consistent with epi_slide
--- it is, but it's not consistent with the old epix_slide
behavior (which used .keep=TRUE
vs. epi_slide
's .keep=FALSE
) --- a change to be documented. Might need to go back to group_modify
if getting cur_data()
takes too much work to get in right format, but this change will require some way of getting the groups
argument behavior (+ checking treatment of group_by
additional args) --- or to just not have this groups
behavior and mirror group_modify
again rather than summarize
--- but neither is compatible with the definite ungroup
-ing of old epix_slide
, which also did not match epi_slide
. [ --- Addressed by moving back to group_modify
in epix_slide
implementation and removing groups
arg.]groups
default behavior is incompatible with the behavior of epi_slide
. [But that matches with the mutate
-like vs. summarize
-like behavior of epi_slide
vs. epix_slide
.] [ --- Addressed by removing ungroup
ing in epix_slide
implementation.]group_modify
s doing the same thing for different types of f
, ...
(TODO: check for / file issue to refactor so there is no duplication, here and in epi_slide
)cur_data_all()
dropping class & metadataepi_slide
n
time steps remaining in n
-> before
, after
; maybe just here to avoid remerge. Also fix "running window", "over variables in x
". Also check in epi_slide
docs after merging n
-> before
, after
there. epix_slide
tests togethersummarize
for sliding, check whether dplyr use in f
could lead to this issue and address / document how to avoid [--- did not check; back to using group_modify
because cur_data_all()
drops epi_df
ness, unlike group_modify
first arg to computation]group_by
behavior, ~default groups
behavior,~ removed group_by
parameter, ~ways to access grouping variables in f
~[, ..cols]
-> [, cols, with=FALSE]
to silence false positive check()
note [ --- did nothing; no longer seeing this note; maybe updated checks don't trigger]groups
parameter -> .groups
, to match summarize
? or do with rest of https://github.com/cmu-delphi/epiprocess/issues/162 [no longer applies; groups
parameter has been removed]$print
for grouped_epi_archive
to print [forwarded/specialized methods, additional methods, missing methods] wrt to epi_archive
.epix_slide()
uses a group_by
to specify the grouping upfront (in"epix_group_by
function delegating to group_by
and epix_ungroup
for ungroup
? ---> No, at least not without also creating epi_group_by
and epi_ungroup
for epi_df
s.ungroup
, or match dplyr
behavior of just not renaming (would require fetching current names at integer indices rather than "new" name indices).ungroup
on non-grouping variables. [--- we match dplyr behavior of allowing non-current-grouping variables in the vars-to-"remove"-from-grouping; not sure if this is the best decision]all_rows
for epix_slide
.~ [migrated]ungroup
after slides in vignettes.groups
and ungroup
for ungrouped epi_archive
s. Good for making routines that work on both without needing a %>% group_by()
to add trivial grouping, although at the cost of making it easier to introduce bugs related to using some method exclusive to grouped or ungrouped methods and thinking it works generally because it worked on one of them.~ [hold off]Musings on group_by
vs. $group_by
:
Having $group_by
mutate the DT
with set
may not make sense, as it just wastes space with copies:
library(data.table)
DT = data.table(a = 2:6, b=3:7)
c = 1:5 # potentially uses ALTREP
d = c(1,5,2,6,3) # probably no ALTREP
old_DT_address = address(DT)
old_a_address = address(DT$a)
set(DT, , c("b", "c", "d"), list(NULL, c, d))
address(DT) == old_DT_address
#> [1] TRUE
address(DT$a) == old_a_address
#> [1] TRUE
address(DT$c) == address(c)
#> [1] FALSE
address(DT$d) == address(d)
#> [1] FALSE
Created on 2022-09-14 with reprex v2.0.2
It would stay more true to the data.table
model, though, but I'm not sure if there's a case where that is useful. I believe that the data.table
model is that generally (but not in all cases as with as.list
), if you have a unique pointer to the data.table
, it should contain unique pointers to each of the columns. So aliasing some columns would generally be forbidden (and so set
copies the columns before adding). But it seems very rare to directly mutate column contents rather than column pointers, so trying to reduce the cases where the former could be problematic at the cost of extra copies in the latter, much more common, scenario, doesn't seem that useful.
So maybe we have $group_by
break the typical data.table
model and make the user worry about aliases not only when mutating things that could be aliasing data tables, but also contents of vectors that could be aliasing data table columns. But it could still be different than group_by
; e.g., $group_by
might set the DT
field to its mutated value. There might be some debate about what's most easily flexible, especially if we don't expose the partially-aliasing limited detailed mutate. However, probably a stronger argument is to not allow this because it would be mutating one thing (original archive's DT) and returning a different thing (grouped archive), rather than returning the same thing invisibly.
--- Conclusion: ---
Having $group_by
act in the same way as group_by
seems to be the cleanest approach. (This is the current approach in the feature branch, so we can just check this item off the list.)
Should we implement
group_by()
as public method in theepi_archive
object? We could do that since it's an R6 object, and whengroup_by()
is called, we could just have it set a private field calledgroup_keys
to whatever variables are passed.The only downstream behavior this would affect is the
as_of()
andslide()
methods for theepi_archive
object. (These are its own public methods.) This would either return a groupepi_df
snapshot, or do a grouped sliding computation, respectively.Pros:
epi_archive
objectx
to do:to do a sliding computation grouped by
geo_value
. This would be precisely analogous to how we would do it on anepi_df
object withepi_slide()
. Currently we have to set the grouping variables as an argument toepix_slide()
.Cons:
epix_slide()
, the objectx
above would still be grouped. This is because R6 objects obey reference semantics. So we'd have toungroup()
it, which may be a bit counterintuitive since it's not howdplyr
verbs work on data frames or tibbles.@lcbrooks @dajmcdon @jacobbien What do you guys think?