`group_by()` for `epi_archive` objects

Closed 1 year ago

ryantibs commented 2 years ago

Should we implement group_by() as public method in the epi_archive object? We could do that since it's an R6 object, and when group_by() is called, we could just have it set a private field called group_keys to whatever variables are passed.

The only downstream behavior this would affect is the as_of() and slide() methods for the epi_archive object. (These are its own public methods.) This would either return a group epi_df snapshot, or do a grouped sliding computation, respectively.



@lcbrooks @dajmcdon @jacobbien What do you guys think?

dajmcdon commented 2 years ago

I feel like the Pro is pretty major, and the Con is pretty minor. But maybe I'm not representative.

My potential confusion:

For dplyr, group_by() followed by summarize() results in an ungrouped object while it remains grouped after mutate(). It feels to me that _slide() is more like mutate() then summarize(), so maybe leaving it grouped isn't so odd? (With the caveat that all of this involves assignment to a result).

On the other hand, the grouping/not behaviour in dplyr is one of those things that I often forget about, and I find myself erroneously assuming something is ungrouped. This has happened many times, so maybe I'm an outlier in terms of actually absorbing the dplyr logic.

brookslogan commented 2 years ago

I think the drawback Ryan is pointing out is not that we need to group(....) %>% epix_slide(.....) %>% ungroup(....) to get no groups, but rather, is this situation:

x = <ungrouped epi_archive>
z1 = x %>% epix_slide(<stuff1>)
y = x %>% group_by(grpvar) %>% epix_slide(<stuff2>)
z2 = x %>% epix_slide(<stuff1>) # different from z1!

epi_slide is like mutate, but epix_slide is more like summarize [except it still partially broadcasts]; while the former only adds columns, the latter will be dropping nonkey nongroup columns + outputting a different class result (epi_archive --> epi_df). I believe that, until recently, summarize left things grouped anyway or wouldn't drop them completely; then it moved to .groups="drop_last" + a message by default; now it seems to be .groups="drop_last" + no message by default. So users should be used to worrying about what happens to their groups... but maybe we could provide a .groups option with a noisy default as well?

[This con] would be [re]solved by moving away from R6+data.table for epi_archive and instead using list+lazy_dt, or having epi_archive be a non-reference-semantics wrapper on top of an R6 EpiArchive backend. We decided this move was low priority though.

brookslogan commented 2 years ago

Addressing/ameliorating the Con:

dajmcdon commented 2 years ago

Oh, I see. Thanks Logan!

ryantibs commented 2 years ago

@lcbrooks @dajmcdon What is your current thinking on this?

Based on our conversation yesterday, it seems like we want to abide by the rule:

Functions applied to R6 objects should not be side-effectful. Only public member functions should be side-effectful.

So that means that we could go with something like Logan's "medium" solution: group_by() should clone everything in the epi_archive object EXCEPT the underlying data.table, and return it as an grouped_epi_archive object.

Are we OK with the fact that we don't clone the underlying data.table? I think so, provided we record this very clearly in the documentation. Plus, we could think of this as not an "R6 special case handling", but a "data table special case handling". Data tables are not cloned, they are typically handled in memory.


brookslogan commented 2 years ago

Thinking about this clone-based "medium" solution:

Just regular archive$clone() might work; I think it's shallow by default (and don't know whether the deep clone feature actually recognizes data.tables and does the special copy operation that would be required). Eventually we may want to move to the "higher effort" solution... we might want to document that while currently, it will be pointing to the same DT as the original, this might change in the future to use something that will copy on write, so users shouldn't rely on this pointer behavior?

There's a bit of complication due to the combination of S3 and R6. A couple of decisions to make:

Top contenders are

A1 involves more S3 implementations, but could also catch some issues when we don't want to directly inherit epi_archive behavior. B1 is closest to dplyr, but might or might not confuse R6 users with the triple class. C2 might be a little confusing with group methods listed for ungrouped archives.

I can't see a clear winner here. Maybe A1 or B1 over C2.

ryantibs commented 2 years ago

@lcbrooks Thanks Logan for the super detailed proposal and analysis!

I'm in favor of B1. The only negative you point out is that it has a triple class, but I don't really see this as a negative to be honest.

Tagging @dajmcdon to see if he has any further thoughts, but barring that, I think to get the ball rolling, you could go ahead and implement this. Seems like it could be a good idea to address #67 in the same PR since it's a related issue.

brookslogan commented 2 years ago

Returning to this. I was trying to write a justification for B1 in a comment, and in doing so convinced myself to switch to A1, because looking at, e.g., the backcasting preprocessors and compactify, I think we're going to want to force ourselves&users be precise about when we're dealing with an ungrouped vs. a grouped epi_archive. And the overhead is comparable to B1, because with B1, we still need to check every inherited function to make sure it's compatible; it's just that if we forget to implement a method in A1, we get errors about unimplemented methods, which can be worked around, and if we forget to specialize/disable in B1, we get invalid results. I'm also adding a safeguard against re-grouping a grouped epi_archive. The epipredict development should hopefully give us an idea of how this all works out.

brookslogan commented 1 year ago

I have what looks like a working implementation on lcb/grouped_epi_archive. I originally hoped to keep the group_by argument to epix_slide, but with a deprecation warning until we discussed what to do with it; that is what is currently implemented. However, with the group_by function making this more like dplyr, we have the dplyr-based (and epi_slide-based) expectation that the default grouping will be no groups, not the key minus version&time_value. So if we are already making a breaking change to the default epix_slide grouping, it might make sense to simultaneously make the breaking change of removing the group_by parameter, especially since the existing approach adds a groups parameter (mirroring summarize's .groups), which might get a little confusing next to group_by. Assuming no objections, the remaining steps are:

brookslogan commented 1 year ago

Musings on group_by vs. $group_by:

Having $group_by mutate the DT with set may not make sense, as it just wastes space with copies:

DT = data.table(a = 2:6, b=3:7)
c = 1:5 # potentially uses ALTREP
d = c(1,5,2,6,3) # probably no ALTREP
old_DT_address = address(DT)
old_a_address = address(DT$a)
set(DT, , c("b", "c", "d"), list(NULL, c, d))
address(DT) == old_DT_address
#> [1] TRUE
address(DT$a) == old_a_address
#> [1] TRUE
address(DT$c) == address(c)
#> [1] FALSE
address(DT$d) == address(d)
#> [1] FALSE

It would stay more true to the data.table model, though, but I'm not sure if there's a case where that is useful. I believe that the data.table model is that generally (but not in all cases as with as.list), if you have a unique pointer to the data.table, it should contain unique pointers to each of the columns. So aliasing some columns would generally be forbidden (and so set copies the columns before adding). But it seems very rare to directly mutate column contents rather than column pointers, so trying to reduce the cases where the former could be problematic at the cost of extra copies in the latter, much more common, scenario, doesn't seem that useful.

So maybe we have $group_by break the typical data.table model and make the user worry about aliases not only when mutating things that could be aliasing data tables, but also contents of vectors that could be aliasing data table columns. But it could still be different than group_by; e.g., $group_by might set the DT field to its mutated value. There might be some debate about what's most easily flexible, especially if we don't expose the partially-aliasing limited detailed mutate. However, probably a stronger argument is to not allow this because it would be mutating one thing (original archive's DT) and returning a different thing (grouped archive), rather than returning the same thing invisibly.

--- Conclusion: --- Having $group_by act in the same way as group_by seems to be the cleanest approach. (This is the current approach in the feature branch, so we can just check this item off the list.)