epiforecasts / EpiNow2

Estimate Realtime Case Counts and Time-varying Epidemiological Parameters
https://epiforecasts.io/EpiNow2/dev/
Other
112 stars 31 forks source link

Distinguish NA (missing) from NA (accumulated) #547

Open sbfnk opened 7 months ago

sbfnk commented 7 months ago

Enabling this would, I think, require some sort of that marks dates explicitly as missing vs. NA.

I think this would be my preferred option as it would be more general but I also think it can be addressed in its own review as it would be a superset of this PR.

My thought on how that would work is to have a new variable (accumulate) that indicates which days should be summed.

Originally posted by @seabbs in https://github.com/epiforecasts/EpiNow2/pull/534#pullrequestreview-1880071190

sbfnk commented 7 months ago

My thought on how that would work is to have a new variable (accumulate) that indicates which days should be summed.

Another option would be to distinguish between explicit (date exists in the data, value is NA meaning missing) vs. implicit (date doesn't exist in the data meaning accumulate) NAs which might be easier preprocessing if potentially easier to inadvertently get wrong.

seabbs commented 7 months ago

yeah potentially but also think that could be a bit dangerous. I would instead suggest making a helper function that maps from that structure to the less dangerous explicit version for those that clearly want that.

seabbs commented 7 months ago

and if going that way I'd suggest that becomes a dependent issue

sbfnk commented 3 days ago

With appropriate warnings messages as suggested in #771 then I think this is the best option as it can take all information from a 2-column data frame as before:

distinguish between explicit (date exists in the data, value is NA meaning missing) vs. implicit (date doesn't exist in the data meaning accumulate) NAs

seabbs commented 3 days ago

I don't think I agree. I think there should be one way of handling missing data (as missing) and it can throw a warning if creating missing dates saying what it is doing.

I think overloading NAs like we have done for accumulation is confusing and dangerous and would much prefer a separate feature describing this.

Something I think we want to be aware of is non-standard schemes. These could be 1. Non-constant reporting and 2. repeated reporting (some counts are reported twice as aggregates of different dates).

I haven't really seen the latter and its quite an edge case so its unclear to me if we really want to support it or not.

sbfnk commented 3 days ago

I'm open to suggestions and acknowledge there are dangers in overloading interpretations. My ideal would be one in which it's fairly straightforward (and safe) to handle the common cases of daily/weekly data on incidence/prevalence and missingness that could correspond to zeroes or missed reports.

jamesmbaazam commented 15 hours ago

With appropriate warnings messages as suggested in #771 then I think this is the best option as it can take all information from a 2-column data frame as before:

distinguish between explicit (date exists in the data, value is NA meaning missing) vs. implicit (date doesn't exist in the data meaning accumulate) NAs

This can now be checked with the test_data_complete() function introduced in #774, when merged.