Closed sbfnk closed 4 weeks ago
This looks really nice from reading the pitch but...
My high-level concern is the proposed interaction between missingness and accumulation. The current design says that something is accumulated until an observation occurs which seems potentially error prone to me? Why did you discount the alternative where these are separable? The reason I think it concerning is that it seems very easy to make a mistake or to get unexpected behaviour. My impression is that having missing data in your weekly example would look like this:
date | confirm | accumulate |
---|---|---|
2024-10-18 | NA | 1 |
2024-10-19 | NA | 1 |
2024-10-20 | NA | 1 |
2024-10-21 | NA | 1 |
2024-10-22 | NA | 1 |
2024-10-23 | NA | 1 |
2024-10-24 | NA | 0 |
2024-10-25 | NA | 1 |
2024-10-26 | NA | 1 |
2024-10-27 | NA | 1 |
2024-10-28 | NA | 1 |
2024-10-29 | NA | 1 |
2024-10-30 | NA | 1 |
2024-10-31 | 17 | 1 |
2024-11-01 | NA | 1 |
2024-11-02 | NA | 1 |
2024-11-03 | NA | 1 |
2024-11-04 | NA | 1 |
2024-11-05 | NA | 1 |
2024-11-06 | NA | 1 |
2024-11-07 | 11 | 1 |
(see the 24th) but would this reset the accumulation counter as you might expect? From the outline the mechanism to reset accumulation is implicit and I think occurs when data is present?
A more restrictive separable approach would be:
date | confirm | accumulate |
---|---|---|
2024-10-23 | NA | 1 |
2024-10-24 | 10 | 0 |
2024-10-26 | NA | 1 |
2024-10-27 | NA | 1 |
2024-10-28 | 17 | 0 |
The likelihood is evaluated on two days, 24 October and 28 October. On 24 October the data (10) is compared to (modelled value on 23 October) + (modelled value on 24 October). On 28 October the data (17) is compared to (modelled value on 26 October) + (modelled value on 27 October) + (modelled value on 28 October).
Whilst removing the interaction this design switches to accumulate
being logical. It loses the partial ascertainment and the ability to skip accumulation (i.e adding a delay effect as you did on the 27th). I am unclear as to the utility of those but happy to hear? You could relax this to make accumulate numeric and so allow for partial ascertainment.
Whilst removing the interaction this design switches to
accumulate
being logical. It loses the partial ascertainment and the ability to skip accumulation (i.e adding a delay effect as you did on the 27th). I am unclear as to the utility of those but happy to hear? You could relax this to make accumulate numeric and so allow for partial ascertainment.
This is where I started but I think in this case we couldn't support partial ascertainment on the day that we reset (as we'd need accumulate == 0
). But perhaps that's clearer as a separate feature/column anyway, and accumulate should just be binary.
Okay makes sense. Do we want to support partial ascertainment is the big question I have here. I am unclear vs how it interacts with other places so the answer is maybe for me? It could let you specify time-varying ascertainment kind of for free if you were careful which seems appealing
If we do is there a way of addressing my concern and example within the framework you've outlined? If we can do that it just leaves the potentially tricky interaction but with good helpers that won't be so bad/.
I think perhaps these things are perhaps best separated. We could easily support partial ascertaining here but this could be a separate numeric column. Accumulate could then be TRUE/FALSE with TRUE accumulating and FALSE resetting. Whether the likelihood is evaluated then depends only on whether the data is NA or not, i.e. it's evaluated at all non-missing data points.
We could easily support partial ascertaining here but this could be a separate numeric column
Yes agree
I agree with making accumulate
binary as it's easier to interpret and without nuance.
If anyone has an idea for a better name for this column than accumulate
now is the time to suggest. Technically if accumulate==FALSE
then the value is still accumulated (i.e. added to what was previously), just that it's reset to zero after then using the accumulated value in the likelihood. We could have a reset
column instead of accumulate
with the opposite logic but I don't think that's clearer.
If people are happy with the latest version (now updated in the PR comment) please approve (or request more changes) and we can create an issue to implement.
Following discussion in #547 here's a proposed design for the future accumulation / missingness interface. As proposed this would also accommodate changes in reporting. As default behaviour would be unchanged it would require fairly minimal deprecation. All comments welcome.
Supporting missing data
We want to support reporting patterns that include accumulation (i.e. batch reporting of data from multiple dates, for example weekly) and missingness (dates which are lacking reports) in incidence and prevalence data.
Proposed interface
Missing data
Any dates between the minimum date and maximum date in the data that is either absent, or present with an
NA
value (currently calledconfirm
) is interpreted as missing and ignored in the likelihood. All other data points are used in the likelihood. This matches the current default behaviour, introduced in version 1.5.0.Accumulation
If instead modelled values on these days should be accumulated onto the next reporting date, the passed
data.frame
must have an additional logical column,accumulate
. Ifaccumulate
is TRUE then the modelled value of the observed latent variable on that day is added to any existing accumulation and stored for later observation. Ifaccumulate
is FALSE the modelled value is added to any stored accumulated variables before potentially being used in the likelihood on that day (if notNA
). Subsequently the stored accumulated variable is reset to zero.Example
The likelihood is evaluated on two days, 24 October and 28 October. On 24 October the data (10) is compared to (modelled value on 23 October) + (modelled value on 24 October). On 28 October the data (17) is compared to (modelled value on 26 October) + (modelled value on 27 October) + (modelled value on 28 October).
Helper functions
A helper function,
fill_missing_dates()
can be used to convert weekly data to the required format, i.e.can be converted with
fill_missing_dates(missing = "accumulate", initial = 7)
to