Fit to multiple time series and fold in secondary model

sbfnk commented 1 year ago

It might be nice to add functionality to have multiple observation time series assuming that they are all reported from the same infection proces with a delay and scale. This would potentially lead to more informed reproduction number / incidence estimation, but also enable doing what is currently done in estimate_secondary.stan within the existing model in estimate_infections.stan (e.g. interpreting the ratio of obs_scale between observed time series of cases and deaths as CFR) without treating the primary observation time series as fixed.

sbfnk commented 1 year ago

If https://github.com/epiforecasts/EpiNow2/pull/307 is eventually merged one could then perhaps also have covariates or a GP prior on e.g. the scaling factor.

sbfnk commented 1 year ago

Proposed interface for specifying multiple observation processes (building on #363):

Multiple delays can already be named with

delay = c(incubation_period = dist_spec(...), reporting_delay = dist_spec(...))

Observation processes could then similarly be concatenated

obs = obs_opts(c(hospitalisations = obs_spec(...), deaths = obs_spec(...)

where obs_spec includes some of the existing obs_opts, i.e. distribution family, week effect etc. In addition it would have something like delays = c("incubation_period", "reporting_delay") referring to the relevant specified delays by name. The scaling option should probably be incorporated into dist_spec so they can be appropriately chained. Ideally this would also include some of the existing observation model of estimate_secondary, e.g. incidence/cumulative. It could also include the option of not specifying a data set, in which case it wouldn't constrain the model but still generated (e.g. if one wanted a forecast of a secondary outcome from a known delay/scaling but where no observation exists, or explore scenarios).

seabbs commented 1 year ago

In addition it would have something like delays = c("incubation_period", "reporting_delay") referring to the relevant specified delays by name.

This part seems key. The alternative would be to make delay a list of lists but with each top level entry being named after the observation model. The upside of this is making it clear when the model is specified how the modules connect but the big obvious downside is some code would need to change. Essentially there would need to be a wrapper (or a list method) for stacking multiple distributions that have been combined with c and some new processing code to handle this.

I am happy with the proposal of identifying delays in the observation module. Some thought is needed about what this would mean for identifying when delays are constant and when they are not and how you could do this if keeping the current processing of delays (i.e a list with features that describe that list vs a list of lists that has these features).

It could also include the option of not specifying a data set, in which case it wouldn't constrain the model but still generated

This is a nice idea but I think should be pushed to its own issue/feature vs trying to implement in the first pass version?

sbfnk commented 1 year ago

This part seems key. The alternative would be to make delay a list of lists but with each top level entry being named after the observation model.

The issue I see with that is that we want uncertain delays to be constrained by all of the observations that are delayed by them, e.g. an incubation period would enter hospitalisations as well as deaths and should be constrained by both (and jointly enter the two), so it seems to me that some way of specifying delays separately and them referencing is needed.

sbfnk commented 1 year ago

This is a nice idea but I think should be pushed to its own issue/feature vs trying to implement in the first pass version?

I agree.

seabbs commented 1 year ago

The issue I see with that is that we want uncertain delays to be constrained by all of the observations that are delayed by them, e.g. an incubation period would enter hospitalisations as well as deaths and should be constrained by both (and jointly enter the two), so it seems to me that some way of specifying delays separately and them referencing is needed.

True but I suppose it depends on onerous repeating delays is vs having to define them in one place and link to them from another. As I suggested in the PR comment the other approach would be nest everything the observation model and so build a model made of observation modules each of which is self-contained. That woudl still lead to repetition of the delays but would have the benefit of I think being easy for the user to understand. Obviously that is a major rewrite/reformulation.

sbfnk commented 1 year ago

True but I suppose it depends on onerous repeating delays is vs having to define them in one place and link to them from another. As I suggested in the PR comment the other approach would be nest everything the observation model and so build a model made of observation modules each of which is self-contained. That woudl still lead to repetition of the delays but would have the benefit of I think being easy for the user to understand. Obviously that is a major rewrite/reformulation.

Is your suggestion that we have e.g.

incubation_period <- dist(mean = 3, sd = 2, mean_mean = 2, mean_sd = 1)
onset_to_admission <- dist(mean = 2, sd = 1)
onset_to_deah <- dist(mean = 5, sd = 3)

with observation modeuls defined as

obs = obs_opts(
    hospitalisations = obs_spec(delays = c(incubation_period, onset_to_admission), ...), 
    deaths = obs_spec(delays = c(incubation_period, onset_to_death), ...)
)

in which case, how would we ensure that the two observation modules are using the same incubation period?

seabbs commented 1 year ago

Yes that is what I had in mind (or something similar (potentially passing obs = c(hospitalisations = obs_opts(delays ...), deaths = obs_opts(delays..). where we can provide observation methods for c and print as you have done for the delays. I like this because then each observation specification is a complete model in some sense (aside from the generation of infections which we are assuming is shared across observations).

I think we just wouldn't because in the parallel case they might not want to. As we are treating them as independent anyway I am not sure this would be a big problem? Just a little inefficient.

This all gets more complicated when you think about where you want to apply scaling (i.e the CFR). Is it applied at infection (which is easiest) is it applied to cases post infection or elsewhere? All of those wash out when it is static but are different if the scaling varies over time.

sbfnk commented 1 year ago

I think we just wouldn't because in the parallel case they might not want to. As we are treating them as independent anyway I am not sure this would be a big problem? Just a little inefficient.

I think it would be a problem if we ended up e.g. with two different posterior distributions for incubation period parameters.

This all gets more complicated when you think about where you want to apply scaling (i.e the CFR). Is it applied at infection (which is easiest) is it applied to cases post infection or elsewhere?

In my mind scaling could be applied with each delay as well as at the time of observation.

seabbs commented 1 year ago

I think it would be a problem if we ended up e.g. with two different posterior distributions for incubation period parameters.

In the dist specification we could assign them unique ids (and let the user tag them manually if they want). This would let us track and combine in our preprocessing to prevent this.

seabbs commented 1 year ago

In my mind scaling could be applied with each delay as well as at the time of observation.

So this would preclude combining static delays and require a rewrite of how we do convolution if wanting time-varying scaling. If not wanting time-varying scaling then it doesn't matter where in the chain it goes (I believe).

epiforecasts / EpiNow2

Fit to multiple time series and fold in secondary model #313