What signature should a forecaster object have?

dshemetov commented 2 years ago

Making this thread to address the broader question opened in #3 about the function signature of a forecaster object. cc @dajmcdon @brookslogan @ryantibs

Current proposals:

A combination of features, response variables, a few index args exposed, and the rest bundled into a control arg
```
prob_arx <- function(x, y, geo_value, time_value, args = prob_arx_args())
```
A plain data interface and the rest of the arguments bundled into a control arg (see Logan's full comment here)
```
function(df, args = prob_arx_args())
```

Two dimensions of variation here:

do forecasters need to be aware of the index variables, if they are already being given axis-stripped training/testing matrices?
does everything except data go into the control arg? what columns does the forecaster expect from df?

One possibility (setting aside the question of whether indexing variables need to be arguments) is that the forecaster signature in (1) could be a function inside the forecaster in signature (2). (2) would do additional parsing of the dataframe df and produce training/testing matrices x, y and hand them off to something that looks like (1).

A question to pump our design intuition: what do we want a forecaster call pattern to look like? For example, following along with the vignettes in epiprocess, we can imagine having a call that looks something like this:

library(covidcast)
library(epiprocess)
library(data.table)
library(dplyr)

forecast_date <- "2021-12-01"
dv <- covidcast_signal("doctor-visits",
                       "smoothed_adj_cli",
                       start_day = "2020-06-01",
                       end_day = "2021-12-01",
                       issues = c("2020-06-01", "2021-12-01"),
                       geo_type = "state",
                       geo_values = c("ca", "fl")) %>% 
           select(geo_value, time_value, version = issue, percent_cli = value) %>%
           as_epi_df()

forecasts <- dv %>% group_by(geo_value) %>%
                     map(some_forecaster, forecast_date = forecast_date, args = prob_arx_args())

This assumes some_forecaster to have a similar signature as when we are using evalcast, namely forecaster(df_list, forecast_date) (I'm not entirely sure if I can pipe the result of the group_by into map like that, but let's suppose that's correct).

That is one possibility. Another possibility is to have an evalcast::get_predictions-like function

forecasts <- get_predictions(dv, some_forecaster, forecast_dates = forecast_dates, args = prob_arx_args())

This function would a lot incorporate grouping and indexing logic, do argument validation, etc. There is logic here that we likely won't want to have a first-time user do on their own. Ideally, once they've put their data in the epi_df or epi_archive format, we can be confident the data has the columns and features we need to do model training and forecasting,

An implicit question here is: how much complexity do we place in the forecaster object and in the get_predictions routine. Does the forecaster assume a clean, indexed time-series, so it can focus on just doing the math/statistics or does it also need to intersect with the domain of validating data, etc.? Do we like the structure we used in evalcast::get_predictions or does that bake-in too much information in one spot and assume a one size fits all model for forecasters?

ryantibs commented 2 years ago

Thanks for writing this out! My thoughts are as follows.

Variables versus data frames

I think it makes sense to specify variables directly. I will make an argument for that. I would say the signature should be:

forecaster(x, y, key_vars, time_value, args)

where

x is a feature matrix (or data frame)
y is a response vector
key_vars is a matrix (or data frame) of key variables
time_value is a vector of time values
args is a list of control arguments

This is fairly easy (intuitive) for a user to understand. Having a model fitting function start with x and y is quite common across a lot of R packages, and everybody knows what x and y stand for here.

Meanwhile the next two arguments in the signature are required to specify the forecast task. The task is to produce one forecast for the latest time value for each unique combination of key variables . So that's how we explain their need in general.

We can allow a user to omit key_vars entirely, which is taken to mean that everything belongs to the same group, so only one forecast gets produced, corresponding to the latest time value. But in this case, we should issue a warning (or fail outright) is the latest time value is not unique.

Why not allow a single data frame to be passed?

If we allowed the user to specify just a single data frame df, then we'd have to tease out what columns of `df represent the response variable, the features, the key variables are, etc. This would either require them to rename the columns of their data frame (annoying on their part) in some way we could detect it automatically, or require a lot of unneccesary complications on our part to figure out what they were passing. It's much simpler in the way I described above, and I believe it covers all/most cases without being so flexible it's confusing.

An example

An example of how I could get this to work is as follows. Suppose I have x which is an epi_archive with variables cases, ctis_cli, and I want to fit a forecaster with cases as the target, and the other two as features. My epi_df has columns geo_value and time_value as usual. I want fit a model jointly over geo values. Then I use something like:

x %>% 
  epix_slide(fc = forecaster(x = ctis_cli, y = cases, 
                             key_vars = geo_value, 
                             time_value = time_value,
                             args = forecaster_args())

More complicated: let's suppose I had another column dv_cli that I also wanted to use as a feature, and I also had age_group that I also wanted to use as a key variable. Then I could do:

x %>% 
  epix_slide(fc = forecaster(x = tibble(ctis_cli, dv_cli), y = cases, 
                             key_vars = tibble(geo_value, age_group), 
                             time_value = time_value,
                             args = forecaster_args())

Do forecasters need to be aware of the time variable (and key variables, if there are any)?

Yes, as supported by my logic above. Otherwise there is no way to safely/generally infer the forecast task.

The only other way I could image doing it is to have the user provide x0, which is a matrix of features at which we want to produce predictions. But I think in general this would be more difficult to populate for a user than just to specify the time and key variables.

Does everything except data go into the control arg?

Yes. More precisely, in my proposal, everything except x, y, key_vars, time_value` goes into the args list.

map/get_predictions

I'm confused about this because we would always just be using epi_slide() or epix_slide() to produce predictions directly. Hence I don't see the need for anything like map() or a new get_predictions() function. (See also the next part.)

How much input validation to do?

My understanding was that, based on one of the recent zoom brainstorming sessions, we decided to modularize any forecaster internally into four parts:

pre-fitting
model fitting
model predictions
post-prediction

So certainly input validation should be provided as part of the functionality for part 1 here. Meaning, checking for time gaps, filling time gaps, etc., should all be offerings.

In my view a "forecaster" is a particular combination of 4 functions (one for each part). In any forecaster we provide, we should give the user flexibility through the forecaster args

A user could decide to turn off the pre-fitting (basically, just use the identity map for part 1) if they wanted to, by signaling this in the control args.

brookslogan commented 2 years ago

I think with the df input approach, one of the args would be a formula like y ~ x, or it could be the names "x" and "y", or just x and y which would be interpreted using quosures. (fable looks like it might be using something like the df + quosure approach.)

Defining the task gets a little messy when you think about variable transformations and combining different components. Consider a count -> rate transform:

Do this by transforming columns without renaming? Then hope that everything is built on taking in rates named as "counts". Mixing and matching transforming for one component & not-transforming for another component might require some hacks.
Do this by creating new columns with [different names]? [Task definitions] might need updated in multiple places, and making a toggle to try count training vs. rate, or log transform vs. not, might look messy.
Do this by create new columns with different names + manipulating some formula specifying the task? Toggling seems easier here, but maybe forecasters need to all have a df + formula/quosure structure.

Wanting a transform-invtransform toggle also shows that pre-fitting and post-prediction elements can be entangled; this might mean we don't want to express this as just a pipe chain from prefit to fit to predict to postpredict, but as various wrappers on top of a fitter&predictor.

ryantibs commented 2 years ago

Thanks for the comments! I think you raised two separate issues.

On the df versus variable names. Sure. If you wanted to say that they could pass a formula and a data frame, and key vars and time value, then that would be easy to accommodate. It would be equivalent to passing x, y, time value, and key vars. I personally don't like the combination of formulas and data frames at all, but I know they're common in R. So the signature could still be

forecaster(x, y, key_vars, time_value, args)

where x is meant to be a feature matrix and y a response vector, or x is a formula and y is a data frame. If that's too confusing, then we could do

forecaster(x, y, formula, df, key_vars, time_value, args)

and you could leave x and y missing if you wanted to use formula and `df.

On the other point about transformations: I didn't see this as an argument for or against any particular way of doing things (variables vs data frames), just a general point to be aware of---is that right? I think it's a decent point but I think I see a way to handle it. Let me try to get a working example and see if you like it.

brookslogan commented 2 years ago

The transformation thing might matter because it impacts whether or not someone writing a new forecaster HAS to use a formula interface or some other manipulable way of representing a forecasting task --- this might rule out the forecaster(x, y, ....) approach, unless x and y are turned into quosures.

Do forecasters have to have an object representing the their task?

Yes: then layering transformations on top is easier and combining forecasters less error-prone
No: then the structure of a forecast is simpler and writing one is simpler

dshemetov commented 2 years ago

Forecaster function signatures

EDIT: Updating this section to explain the confusion I had to others that might have a similar misconception.

I do agree @ryantibs that

forecaster(df, args)

is too ambiguous and is too permissive.

I was confused by the apparent contradiction between the statement that x and y are matrices/dataframes here

forecaster(x, y, key_vars, time_value, args)

but below they appear to be either variable names or something more ambiguous

x %>% 
  epix_slide(fc = forecaster(x = ctis_cli, y = cases, 
                             key_vars = geo_value, 
                             time_value = time_value,
                             args = forecaster_args())

To clarify, this is an instance of a tidy evaluation idiom in epix_slide, where ctis_cli and cases refer to variables inside the environment of the x epi_df. So forecaster still accepts what it claims to accept, while allowing more readability in the code above.

Is sliding universal for all forecasters?

A follow-up comment and relating to the question of why need get_predictions: I am not convinced that epi_slide can cover all the possible forecaster designs. Specifically, it assumes a past-dependency structure for the forecaster, such as (1) that any forecast is based on a fixed window of length n into the past, (2) that adjacent forecasts don't depend on each other (iterative or residual training forecasters). My claim is that once we go down the path of trying to adjust epi_slide to cover forecasters that don't fit these assumptions, epi_slide will begin to look like a get_predictions function. My evidence for this is that this issue already came up in a brainstorm recently:

A problem: epi_slide() and epix_slide(), while sliding, do not have access to the past computations! Can get around calling them twice. So it's more clumsy than a for loop in that way.

All that said, I do agree that sliding is a key computation pattern and covers AR-like models (which is all (most?) of our hospitalization forecasters). The points I want to make are: (a) a general get_predictions function may be inevitable, (b) our design should recognize the strengths of epi_slide and not try to cover cases it can't cover.

brookslogan commented 2 years ago

Function signatures

We probably need to investigate fable to see if we want to be compatible with it. Its approach looks something like (formula + args) --> , and (df, ) -> forecasts

Gaps in epi_slide

Interesting points. Current get_predictions doesn't seem to handle this either? Maybe inspires an epi_slide_reduce variant?

Kalman + backfill seems too complicated to me to think about for current development.

ryantibs commented 2 years ago

Quick note on epi_slide()/epix_slide(): setting n = Inf in either gives you access to all-past, which is an allowable value for that argument. So it's not really limited to a fixed-length past.

The lack of knowledge of "what happened" in past slide computations is real though. That said, I believe it should be possible to fix this, at least in epix_slide(). Should probably create an issue for this.

ryantibs commented 2 years ago

Also, just for my own notes: I think we cleared up Dmitry's confusion on the first point in his last message on our call. (It was about tidy evaluation which is nonstandard and takes some getting used to.)

And yes I think sliding is fairly universal, as per my last message, but we need to allow access to past computations to make it truly universal.

So at this point I don't see many unanswered/unresolved questions about the design. I think we arrived at this.

Common structure: forecaster(x, y, key_vars, time_value, args), where x = feature matrix, y = response vector.
If we want to allow them to use formulas and data frames, then we can detect and accept x = formula and y = data frame.
The formula allows for variable transformations on-the-fly, as long as they are applied to features only (not the response). That is one of its strengths.
A forecaster is a stack of four functions: pre-fitting, fitting, prediction, post-prediction, each of which come with options. We expose some of those options to the args list in the forecaster (not all; that would be too much probably).
A really advanced user could define their own “stack” manually to get the forecaster to be just the they want it.

dajmcdon commented 2 years ago

@ryantibs In your "An example" above, what is the internal behavior of the slide function? I'm guessing that in the first case (the simple one where all the 4 inputs are variable names), the forecaster actually "sees" vectors (maybe named?) while in the more complicated version, it "sees" tibbles for x and key_vars? Or does it see a tibble in all cases?

ryantibs commented 2 years ago

What the forecaster sees: yes, should be all vectors in the first case. And in the second case, tibbles for x and key_vars.

dajmcdon commented 2 years ago

Decision: sticking with

forecaster(x, y, key_vars, time_value, args)

cmu-delphi / epipredict