epiforecasts / EpiNow2

Estimate Realtime Case Counts and Time-varying Epidemiological Parameters
https://epiforecasts.io/EpiNow2/dev/
Other
114 stars 32 forks source link

reported_cases_opts() #346

Open seabbs opened 1 year ago

seabbs commented 1 year ago

Currently, all inputs except for reported_cases are managed via a helper function (i.e delay_opts()) that enables specifying options. It would make sense to standardized reported_cases to be in line with this approach. This would also make sense as currently there are several data processing steps (i.e to deal with missing dates) that occur inside package code that are not well surfaced to the user. Putting these in a reported_cases_opts() function would help resolve this and bring the user closer to the data being used for model fitting which is generally a good idea.

sbfnk commented 9 months ago

Perhaps data_opts() would be better if it's to have preprocessing etc.? I think "reported cases" is a bit too specific a name for the argument anyway (could be hospitalisations etc.).

sbfnk commented 8 months ago

A downside of this approach is that, if we want a common data_opts() for all the three models contained in the package (which I think we do) then we can't use this to document requirements of the data (estimate_truncation and estimate_secondary: data frames with specific columns; estimate_truncation: list) and will instead have to point to the specific function documentation.

An alternative solution would be to keep the reported_cases/primary/obs argument(s) as first argument and add preprocessing_opts() for filtering zeros etc. This would also have a potential benefit of being slightly easier to integrate in a pipe.

sbfnk commented 7 months ago

Trying to separate the steps involved in closing the issue if going with the approach suggested in the previous comment:

Once this issue is closed we can then make the data column argument more flexible, addressing https://github.com/epiforecasts/EpiNow2/issues/505

jamesmbaazam commented 7 months ago
  • [ ] rename first argument of all three models to data

Makes sense to me. Would bullet 4 require us to deprecate the current names?

sbfnk commented 7 months ago

Yes, I think so.

jamesmbaazam commented 7 months ago

I can work on this if it's good to go.

sbfnk commented 7 months ago

Yes, it should be good to go - I think ideally addressing each of the bullet points in sequence using separate PRs.

jamesmbaazam commented 6 months ago

I might be wrong but the proposed preprocessing_opts() function is just create_clean_reported_cases() right?.

sbfnk commented 6 months ago

I might be wrong but the proposed preprocessing_opts() function is just create_clean_reported_cases() right?.

Good question. There are two options:

  1. we create a preprocessing function that would create clean data that can then passed to the relevant functions
  2. we create a preprocessing_opts() function that collates the arguments (as with the other ..._opts() functions) which are passed to create_clean_reported_cases() within the relevant functions. This would add a preprocessing (or other?) argument to those functions.

Option (2) is might be the easier one as it doesn't require updating any internal logic (e.g. where some internal processing is done before calling create_clean_reported_cases(). It's perhaps also more in line with the existing function/argument structure. But I'm open to arguments for (1).

seabbs commented 6 months ago
  1. Is definitely more in line with other bits of the logic but I much prefer the idea of making the data preprocessing apparent and accessible to people (i.e 1.)
sbfnk commented 6 months ago

If going down that route we probably want to rename it to create_clean_data() or preprocess_data() in line with the other changes triggered by this issue.

jamesmbaazam commented 6 months ago
  1. Is definitely more in line with other bits of the logic but I much prefer the idea of making the data preprocessing apparent and accessible to people (i.e 1.)

Two options seem to be apparent here:

jamesmbaazam commented 6 months ago

If going down that route we probably want to rename it to create_clean_data() or preprocess_data() in line with the other changes triggered by this issue.

In particular, the zero_threshold cleaning step (https://github.com/epiforecasts/EpiNow2/blob/main/R/create.R#L60-L77) deserves being in a separate function.

sbfnk commented 6 months ago

I really like those suggestions. I think they might take a bit more thinking though so would suggest to push them into a future release in order to get 1.5.0 out asap.

jamesmbaazam commented 6 months ago

I really like those suggestions. I think they might take a bit more thinking though so would suggest to push them into a future release in order to get 1.5.0 out asap.

Agreed. They're not user-facing so not a priority for this release.

sbfnk commented 6 months ago

The suggestions here could also help address #547 and #640

sbfnk commented 5 months ago

Proposed set up for data handling / cleaning would be to

In principle the horizon stuff could also be separate in an add_horizon() function but perhaps making settings about forecasts a data manipulating function is confusing? At the same time there's a certain elegance to it, i.e. future dates are just another form of missing data and the model doesn't actually need to know when the present is except for plotting and the future argument in rt_opts(). The present could be inferred as the last known data point.

sbfnk commented 5 months ago

We probably still want a data_opts() function where users can specify which column in the data frame indicates the date and which the data to fit to (if not the standard names), see https://github.com/epiforecasts/EpiNow2/issues/505

seabbs commented 5 months ago

an add_horizon() fun

I like this idea a lot

sbfnk commented 5 months ago

@jamesmbaazam what do you think?

jamesmbaazam commented 5 months ago

I like all the points.

In principle the horizon stuff could also be separate in an add_horizon() function but perhaps making settings about forecasts a data manipulating function is confusing? At the same time there's a certain elegance to it, i.e. future dates are just another form of missing data and the model doesn't actually need to know when the present is except for plotting and the future argument in rt_opts(). The present could be inferred as the last known data point.

I think the forecasting stuff should probably be done internally using forecast_opts() rather than being a preprocessing step. It removes one step in the model setup.

Maybe, frequency -> interval?? (Unless I don't understand what frequency means here).

add a clean_data() or similar function that does the other steps mentioned above, i.e. filtering leading zeroes and handling the zero threshold. This would be in line with https://github.com/epiforecasts/EpiNow2/issues/346#issuecomment-2082406912

Maybe, more specifically, handle_zero_cases()?

jamesmbaazam commented 5 months ago

We probably still want a data_opts() function where users can specify which column in the data frame indicates the date and which the data to fit to (if not the standard names), see #505

This seems to suit the use case for linelist but not suggesting we take it on as a dependency.

seabbs commented 5 months ago

We probably still want a data_opts() function where users can specify which column in the data frame indicates the date and which the data to fit to (if not the standard names), see

I'm generally pretty sceptical of this as an idea. If users are going to get useful things out of the package they likely have thoughts on how to change the name of a variable already.

Or is the proposal you then track their column name through the code base? Again I am a bit sceptical of the value added here to most users.

sbfnk commented 5 months ago

I'm generally pretty sceptical of this as an idea. If users are going to get useful things out of the package they likely have thoughts on how to change the name of a variable already.

I generally agree, just flagging that if ever going ahead with the suggestion in https://github.com/epiforecasts/EpiNow2/discussions/371#discussion-4891223 we might need a way to point out which column in a passed data frame corresponds to which observation model (though that'll likely look different from a data_opts() option).

Somewhere we might also want users to specify what the data represent, i.e. https://github.com/epiforecasts/EpiNow2/issues/505

sbfnk commented 3 months ago

It seems we broadly have two options here:

obs |>
  rename(value = confirm) |>
  filter_leading_zeroes() |>
  apply_zero_threshold(threshold = 10) |>
  add_horizon(n = 3, frequency = 7) |>
  estimate_infections()

or

obs |>
  estimate_infections(
    data = data_opts(col = "confirm", zero_threshold = 10),
    forecast = forecast_opts(n = 3, frequency = 7)
  )
jamesmbaazam commented 3 months ago

I think the second cannot be piped that way as by definition, the pipe passes the data to the first argument. So it will rather be

  estimate_infections(
    data = data_opts(obs, col = "confirm", zero_threshold = 10),
    forecast = forecast_opts(n = 3, frequency = 7)
  )

Until now, users have not had to do any data cleaning using exported functions here, so I am more inclined to vote for the second option.

sbfnk commented 3 months ago

If the case data set becomes part of data_opts yes, but I don't think it has to be (though of course it might make sense for it to do so).

sbfnk commented 3 months ago

Until now, users have not had to do any data cleaning using exported functions here, so I am more inclined to vote for the second option.

That is a valid point. A counterpoint would be that with the explicit functions users can actually see what happens (e.g. which values get filtered out / changed, or which dates will be forecast) whereas if it's all internal to estimate_infections() they have no way of accessing that information.

jamesmbaazam commented 3 months ago

If the case data set becomes part of data_opts yes, but I don't think it has to be (though of course it might make sense for it to do so).

We already have a data argument so if we are going by the original second option, it will have to be named something else, then the piping will work.

That is a valid point. A counterpoint would be that with the explicit functions users can actually see what happens (e.g. which values get filtered out / changed, or which dates will be forecast) whereas if it's all internal to estimate_infections() they have no way of accessing that information.

Valid point. Alternatively, we could improve the logging and messaging in the current setup to report all of this.