cmu-delphi / epiprocess

Tools for basic signal processing in epidemiology
https://cmu-delphi.github.io/epiprocess/
Other
13 stars 9 forks source link

Consider tidyselect in `epi_slide()` #18

Closed brookslogan closed 2 years ago

brookslogan commented 2 years ago

Users may want to directly write slide_by_geo(newcol = f(oldcol), ........) rather than separately specifying the function and name. We could use rlang/tidyselect stuff to accomplish this.

transmute_scattered_ts and mutate_scattered_ts from #7 performs the time_value wrangling required to apply functions assuming that values correspond to increasing, evenly-spaced time values. One can basically just write a transmute statement and add the additional time window info requires. This might potentially help implement this revision. It would be group-by-geo followed by mutate_scattered_ts and ungroup. The mutation might be tricky as mutate_scattered_ts works on vectorized functions while slide_by_geo works on non-vectorized functions.

ryantibs commented 2 years ago

Thanks @brookslogan. That seems like it could/would be great for usability and readability! If you can get a demo working I think it'd be really nice to see.

Note: I think what you're suggesting could replace the formula option in the slide function. It'd be much cleaner. However, in certain scenarios, I could imagine that you'd still want to be able to specify the function directly, i.e., when it's a complicated function that takes the whole input data frame (row-subsetted appropriately), like in some forecasting tasks.

ryantibs commented 2 years ago

Adding more thoughts to record them, even if just for my own sake. In a helpful side conversation, @earowang suggested something similar, where we refactor things so that there aren't separate functions for pct_change(), etc., but these operate within a call to epi_slide() (just like mutate()), as in:

covid_data %>% epi_slide(cases_pct_change = pct_change(cases, n = 14))

That is, pct_change() and deriv() become helpers to be used inside a call to epi_slide(), rather than standalone functions as they are now. (Here I guess I would rename estimate_deriv() to deriv().)

It seems like we could pull this off similarly to how @earowang implemented index_by() in the tsibble package, see: https://github.com/tidyverts/tsibble/blob/7261fbb858d5d1ab82cce7a7bebeffcb08e3915f/R/index-by.R#L83-L113.

One thing I'm uncertain of is whether the aforementioned implementation leverages tidyselect functionality, or does it differently using expression evaluation tools from rlang. I didn't study it careful enough to know (nor do I know enough about these topics to be able to tell).

ryantibs commented 2 years ago

Wondering what @earowang thinks about my last comment and the example I gave there? Is that along the lines of what you were thinking?

And do you have an opinion on what the path towards implementation would be---would it be good to follow the example you gave in index_by() as I highlighted above?

earowang commented 2 years ago

Does epi_slide() by default slide by geo_value?

Can we group currently available functions into two sections: (1) table functions handling data frames and (2) vector functions?

Regarding table functions, you may like to name them more generically, e.g. slide_by(). For vector functions, you could add epi_ prefix, e.g. epi_pct_change(), which enables autocompletion for less typing.

covid_data %>% 
  slide_by(cases_pct_change = epi_pct_change(cases, n = 14))
brookslogan commented 2 years ago

I don't think epi_slide groups by geo_value anymore after #14; the user will need to group_by first.

(Regarding naming I don't have a clear opinion right now. Maybe consider potential confusion introduced by mixing epi_ and non-epi_ functions in the same package when a few existing packages have the pattern of prefixing nearly all members (e.g., stringr::str_*).)

By tidyselect I wasn't being very precise. I think it makes sense to try to mirror the behavior of transmute and/or mutate; indeed, transmute_scattered_ts and mutate_scattered_ts from #7 comments are implemented on top of one of those.

Another detail to think about: what should be the output time_values? Users may consider using epi_slide to prepare covariate and/or response data, but things get tricky; consider:

epi.tibble %>% 
  group_by(geo) %>%
  epi_slide(x1= dplyr::lag(cases, 3L), x2 = dplyr::lag(cases, 10L))

Run on daily data with no missingness, the output will not include the most recent 3 time_values, and the first three time_values will have NA output. For real-time forecasting, the dropped time_values might include the test instance. This is undesirable. But what would be the interface to specify what the time_values should be? (I can't think of anything that avoids this dropped-data situation that doesn't involve manually specifying the output time_values, typing lead&lag values multiple times and potentially making mistakes, or first going through another function that users may not know about.)

Now consider also including y = dplyr::lead(cases, 14L); now there will be some instances with non-missing ys dropped, but I think this is less of a concern since we'd normally chop off these anyway because all the covariates will probably be lagging and be NA.

ryantibs commented 2 years ago

Thanks both! I like the idea of separating out at least conceptually into table functions and vector functions.

Re naming: I'm not sure it makes sense to prefix everything by epi, since many functions in this package will be generic ones. I'd prefer to keep the prefix to the few things that are actually special to the epi_df data structure (the latest name suggested for this data structure, see #27).

Re mutate: can you look at the code I linked above (in Earo's index_by() implementation) and see whether you think it's suitable, and/or how it differs from yours? I understand you want to mimic mutate(), and I think that's a good idea, but I'm just trying to reconcile what the template should be for how to do this.

Re time values: this is a good point. But I think any desired behavior could be accomplished with some combination of setting the following two arguments:

  1. complete (should we require a complete window for sliding? see #21); and
  2. na_rm (should the underlying slide function let NA inputs propagate into an NA output?).
brookslogan commented 2 years ago

Re time values: I think this is more about preventing users from accidentally doing something they don't want rather than being able to do what they want. For example, anything involving dplyr::lag or rolling-sum/apply functions are convenient but probably buggy --- they don't behave well when there are gaps, and have the issues with the output time_values potentially chopping off desired data. Maybe we could at least detect these types of functions and warn if the user uses them. But then we probably also would want an option to silence these warnings or to specify what lag windows they actually intend to use.

brookslogan commented 2 years ago

Re index_by: I think this is not the exact semantics we want for epi_slide/slide_by; it is essentially grouping by (or adding to the existing non-index groups) the time index or some coarsening thereof. We'd want something like that for aggregating daily data into epiweek resolution, but not for computing shifts or rolling-window operations. But we can try to mimic the interface! Haven't thought much, but if we already have our own custom class(es), then making this like a modified group_by/index_by does seem more natural.