Open seabbs opened 1 year ago
Perhaps data_opts()
would be better if it's to have preprocessing etc.? I think "reported cases" is a bit too specific a name for the argument anyway (could be hospitalisations etc.).
A downside of this approach is that, if we want a common data_opts()
for all the three models contained in the package (which I think we do) then we can't use this to document requirements of the data (estimate_truncation
and estimate_secondary
: data frames with specific columns; estimate_truncation
: list) and will instead have to point to the specific function documentation.
An alternative solution would be to keep the reported_cases
/primary
/obs
argument(s) as first argument and add preprocessing_opts()
for filtering zeros etc. This would also have a potential benefit of being slightly easier to integrate in a pipe.
Trying to separate the steps involved in closing the issue if going with the approach suggested in the previous comment:
filter_leading_zeros
and zero_threshold
to estimate_secondary
and estimate_truncation
as explicit argumentspreprocessing_opts()
(or a shorter name?) function and call this from all modelsdata
Once this issue is closed we can then make the data column argument more flexible, addressing https://github.com/epiforecasts/EpiNow2/issues/505
- [ ] rename first argument of all three models to
data
Makes sense to me. Would bullet 4 require us to deprecate the current names?
Yes, I think so.
I can work on this if it's good to go.
Yes, it should be good to go - I think ideally addressing each of the bullet points in sequence using separate PRs.
I might be wrong but the proposed preprocessing_opts()
function is just create_clean_reported_cases()
right?.
I might be wrong but the proposed
preprocessing_opts()
function is justcreate_clean_reported_cases()
right?.
Good question. There are two options:
preprocessing_opts()
function that collates the arguments (as with the other ..._opts()
functions) which are passed to create_clean_reported_cases()
within the relevant functions. This would add a preprocessing
(or other?) argument to those functions.Option (2) is might be the easier one as it doesn't require updating any internal logic (e.g. where some internal processing is done before calling create_clean_reported_cases()
. It's perhaps also more in line with the existing function/argument structure. But I'm open to arguments for (1).
If going down that route we probably want to rename it to create_clean_data()
or preprocess_data()
in line with the other changes triggered by this issue.
- Is definitely more in line with other bits of the logic but I much prefer the idea of making the data preprocessing apparent and accessible to people (i.e 1.)
Two options seem to be apparent here:
data
argument can only accept objects run through preprocessing_opts()
raw_data %>% preprocessing_opts() %>% estimate_*()
.If going down that route we probably want to rename it to
create_clean_data()
orpreprocess_data()
in line with the other changes triggered by this issue.
create_clean_reported_cases()
into:
add_horizon()
- completes the dates and adds the horizon
windowadd_breakpoints()
filter_leading_zeros()
handle_zero_threshold()
In particular, the zero_threshold
cleaning step (https://github.com/epiforecasts/EpiNow2/blob/main/R/create.R#L60-L77) deserves being in a separate function.
I really like those suggestions. I think they might take a bit more thinking though so would suggest to push them into a future release in order to get 1.5.0 out asap.
I really like those suggestions. I think they might take a bit more thinking though so would suggest to push them into a future release in order to get 1.5.0 out asap.
Agreed. They're not user-facing so not a priority for this release.
The suggestions here could also help address #547 and #640
Proposed set up for data handling / cleaning would be to
horizon
argument from estimate_infections
etc.forecast
argument with function forecast_opts
with arguments horizon
, frequency
, and maybe future
(from rt_opts()
) - internally this will add future dates to the passed data frame with NA
values at a given frequency (which will ensure it works with the set up for accumulating incidence if desired)clean_data()
or similar function that does the other steps mentioned above, i.e. filtering leading zeroes and handling the zero threshold. This would be in line with https://github.com/epiforecasts/EpiNow2/issues/346#issuecomment-2082406912In principle the horizon stuff could also be separate in an add_horizon()
function but perhaps making settings about forecasts a data manipulating function is confusing? At the same time there's a certain elegance to it, i.e. future dates are just another form of missing data and the model doesn't actually need to know when the present is except for plotting and the future
argument in rt_opts()
. The present could be inferred as the last known data point.
We probably still want a data_opts()
function where users can specify which column in the data frame indicates the date and which the data to fit to (if not the standard names), see https://github.com/epiforecasts/EpiNow2/issues/505
an add_horizon() fun
I like this idea a lot
@jamesmbaazam what do you think?
I like all the points.
In principle the horizon stuff could also be separate in an add_horizon() function but perhaps making settings about forecasts a data manipulating function is confusing? At the same time there's a certain elegance to it, i.e. future dates are just another form of missing data and the model doesn't actually need to know when the present is except for plotting and the future argument in rt_opts(). The present could be inferred as the last known data point.
I think the forecasting stuff should probably be done internally using forecast_opts()
rather than being a preprocessing step. It removes one step in the model setup.
Maybe, frequency
-> interval
?? (Unless I don't understand what frequency
means here).
add a
clean_data()
or similar function that does the other steps mentioned above, i.e. filtering leading zeroes and handling the zero threshold. This would be in line with https://github.com/epiforecasts/EpiNow2/issues/346#issuecomment-2082406912
Maybe, more specifically, handle_zero_cases()
?
We probably still want a
data_opts()
function where users can specify which column in the data frame indicates the date and which the data to fit to (if not the standard names), see #505
This seems to suit the use case for linelist but not suggesting we take it on as a dependency.
We probably still want a data_opts() function where users can specify which column in the data frame indicates the date and which the data to fit to (if not the standard names), see
I'm generally pretty sceptical of this as an idea. If users are going to get useful things out of the package they likely have thoughts on how to change the name of a variable already.
Or is the proposal you then track their column name through the code base? Again I am a bit sceptical of the value added here to most users.
I'm generally pretty sceptical of this as an idea. If users are going to get useful things out of the package they likely have thoughts on how to change the name of a variable already.
I generally agree, just flagging that if ever going ahead with the suggestion in https://github.com/epiforecasts/EpiNow2/discussions/371#discussion-4891223 we might need a way to point out which column in a passed data frame corresponds to which observation model (though that'll likely look different from a data_opts()
option).
Somewhere we might also want users to specify what the data represent, i.e. https://github.com/epiforecasts/EpiNow2/issues/505
It seems we broadly have two options here:
obs |>
rename(value = confirm) |>
filter_leading_zeroes() |>
apply_zero_threshold(threshold = 10) |>
add_horizon(n = 3, frequency = 7) |>
estimate_infections()
or
obs |>
estimate_infections(
data = data_opts(col = "confirm", zero_threshold = 10),
forecast = forecast_opts(n = 3, frequency = 7)
)
I think the second cannot be piped that way as by definition, the pipe passes the data to the first argument. So it will rather be
estimate_infections(
data = data_opts(obs, col = "confirm", zero_threshold = 10),
forecast = forecast_opts(n = 3, frequency = 7)
)
Until now, users have not had to do any data cleaning using exported functions here, so I am more inclined to vote for the second option.
If the case data set becomes part of data_opts
yes, but I don't think it has to be (though of course it might make sense for it to do so).
Until now, users have not had to do any data cleaning using exported functions here, so I am more inclined to vote for the second option.
That is a valid point. A counterpoint would be that with the explicit functions users can actually see what happens (e.g. which values get filtered out / changed, or which dates will be forecast) whereas if it's all internal to estimate_infections()
they have no way of accessing that information.
If the case data set becomes part of
data_opts
yes, but I don't think it has to be (though of course it might make sense for it to do so).
We already have a data
argument so if we are going by the original second option, it will have to be named something else, then the piping will work.
That is a valid point. A counterpoint would be that with the explicit functions users can actually see what happens (e.g. which values get filtered out / changed, or which dates will be forecast) whereas if it's all internal to estimate_infections() they have no way of accessing that information.
Valid point. Alternatively, we could improve the logging and messaging in the current setup to report all of this.
Currently, all inputs except for
reported_cases
are managed via a helper function (i.edelay_opts()
) that enables specifying options. It would make sense to standardizedreported_cases
to be in line with this approach. This would also make sense as currently there are several data processing steps (i.e to deal with missing dates) that occur inside package code that are not well surfaced to the user. Putting these in areported_cases_opts()
function would help resolve this and bring the user closer to the data being used for model fitting which is generally a good idea.