epiforecasts / EpiNow2

Estimate Realtime Case Counts and Time-varying Epidemiological Parameters
https://epiforecasts.io/EpiNow2/dev/
Other
116 stars 33 forks source link

Proposed interface for accumulation / missingness #839

Closed sbfnk closed 4 weeks ago

sbfnk commented 1 month ago

Following discussion in #547 here's a proposed design for the future accumulation / missingness interface. As proposed this would also accommodate changes in reporting. As default behaviour would be unchanged it would require fairly minimal deprecation. All comments welcome.

Supporting missing data

We want to support reporting patterns that include accumulation (i.e. batch reporting of data from multiple dates, for example weekly) and missingness (dates which are lacking reports) in incidence and prevalence data.

Proposed interface

Missing data

Any dates between the minimum date and maximum date in the data that is either absent, or present with an NA value (currently called confirm) is interpreted as missing and ignored in the likelihood. All other data points are used in the likelihood. This matches the current default behaviour, introduced in version 1.5.0.

Accumulation

If instead modelled values on these days should be accumulated onto the next reporting date, the passed data.frame must have an additional logical column, accumulate. If accumulate is TRUE then the modelled value of the observed latent variable on that day is added to any existing accumulation and stored for later observation. If accumulate is FALSE the modelled value is added to any stored accumulated variables before potentially being used in the likelihood on that day (if not NA). Subsequently the stored accumulated variable is reset to zero.

Example

date confirm accumulate
2024-10-23 NA TRUE
2024-10-24 10 FALSE
2024-10-26 NA TRUE
2024-10-27 NA TRUE
2024-10-28 17 FALSE

The likelihood is evaluated on two days, 24 October and 28 October. On 24 October the data (10) is compared to (modelled value on 23 October) + (modelled value on 24 October). On 28 October the data (17) is compared to (modelled value on 26 October) + (modelled value on 27 October) + (modelled value on 28 October).

Helper functions

A helper function, fill_missing_dates() can be used to convert weekly data to the required format, i.e.

date confirm
2024-10-24 10
2024-10-31 17
2024-11-07 11

can be converted with fill_missing_dates(missing = "accumulate", initial = 7) to

date confirm accumulate
2024-10-18 NA TRUE
2024-10-19 NA TRUE
2024-10-20 NA TRUE
2024-10-21 NA TRUE
2024-10-22 NA TRUE
2024-10-23 NA TRUE
2024-10-24 10 FALSE
2024-10-25 NA TRUE
2024-10-26 NA TRUE
2024-10-27 NA TRUE
2024-10-28 NA TRUE
2024-10-29 NA TRUE
2024-10-30 NA TRUE
2024-10-31 17 FALSE
2024-11-01 NA TRUE
2024-11-02 NA TRUE
2024-11-03 NA TRUE
2024-11-04 NA TRUE
2024-11-05 NA TRUE
2024-11-06 NA TRUE
2024-11-07 11 TRUE
seabbs commented 1 month ago

This looks really nice from reading the pitch but...

My high-level concern is the proposed interaction between missingness and accumulation. The current design says that something is accumulated until an observation occurs which seems potentially error prone to me? Why did you discount the alternative where these are separable? The reason I think it concerning is that it seems very easy to make a mistake or to get unexpected behaviour. My impression is that having missing data in your weekly example would look like this:

date confirm accumulate
2024-10-18 NA 1
2024-10-19 NA 1
2024-10-20 NA 1
2024-10-21 NA 1
2024-10-22 NA 1
2024-10-23 NA 1
2024-10-24 NA 0
2024-10-25 NA 1
2024-10-26 NA 1
2024-10-27 NA 1
2024-10-28 NA 1
2024-10-29 NA 1
2024-10-30 NA 1
2024-10-31 17 1
2024-11-01 NA 1
2024-11-02 NA 1
2024-11-03 NA 1
2024-11-04 NA 1
2024-11-05 NA 1
2024-11-06 NA 1
2024-11-07 11 1

(see the 24th) but would this reset the accumulation counter as you might expect? From the outline the mechanism to reset accumulation is implicit and I think occurs when data is present?

A more restrictive separable approach would be:

Example

date confirm accumulate
2024-10-23 NA 1
2024-10-24 10 0
2024-10-26 NA 1
2024-10-27 NA 1
2024-10-28 17 0

The likelihood is evaluated on two days, 24 October and 28 October. On 24 October the data (10) is compared to (modelled value on 23 October) + (modelled value on 24 October). On 28 October the data (17) is compared to (modelled value on 26 October) + (modelled value on 27 October) + (modelled value on 28 October).

Whilst removing the interaction this design switches to accumulate being logical. It loses the partial ascertainment and the ability to skip accumulation (i.e adding a delay effect as you did on the 27th). I am unclear as to the utility of those but happy to hear? You could relax this to make accumulate numeric and so allow for partial ascertainment.

sbfnk commented 1 month ago

Whilst removing the interaction this design switches to accumulate being logical. It loses the partial ascertainment and the ability to skip accumulation (i.e adding a delay effect as you did on the 27th). I am unclear as to the utility of those but happy to hear? You could relax this to make accumulate numeric and so allow for partial ascertainment.

This is where I started but I think in this case we couldn't support partial ascertainment on the day that we reset (as we'd need accumulate == 0). But perhaps that's clearer as a separate feature/column anyway, and accumulate should just be binary.

seabbs commented 1 month ago

Okay makes sense. Do we want to support partial ascertainment is the big question I have here. I am unclear vs how it interacts with other places so the answer is maybe for me? It could let you specify time-varying ascertainment kind of for free if you were careful which seems appealing

If we do is there a way of addressing my concern and example within the framework you've outlined? If we can do that it just leaves the potentially tricky interaction but with good helpers that won't be so bad/.

sbfnk commented 1 month ago

I think perhaps these things are perhaps best separated. We could easily support partial ascertaining here but this could be a separate numeric column. Accumulate could then be TRUE/FALSE with TRUE accumulating and FALSE resetting. Whether the likelihood is evaluated then depends only on whether the data is NA or not, i.e. it's evaluated at all non-missing data points.

seabbs commented 1 month ago

We could easily support partial ascertaining here but this could be a separate numeric column

Yes agree

jamesmbaazam commented 1 month ago

I agree with making accumulate binary as it's easier to interpret and without nuance.

sbfnk commented 4 weeks ago

If anyone has an idea for a better name for this column than accumulate now is the time to suggest. Technically if accumulate==FALSE then the value is still accumulated (i.e. added to what was previously), just that it's reset to zero after then using the accumulated value in the likelihood. We could have a reset column instead of accumulate with the opposite logic but I don't think that's clearer.

sbfnk commented 4 weeks ago

If people are happy with the latest version (now updated in the PR comment) please approve (or request more changes) and we can create an issue to implement.