cmu-delphi / epiprocess

Tools for basic signal processing in epidemiology
https://cmu-delphi.github.io/epiprocess/
Other
13 stars 8 forks source link

Potential design issue with epix_merge COMPACTIFY #515

Open rnayebi21 opened 2 months ago

rnayebi21 commented 2 months ago

Was talking with @dshemetov about an example I came up with for epix_merge() see below:

s1 <- tibble(
  geo_value = c("ca", "ca", "ca"),
  time_value = as.Date(c("2024-08-01","2024-08-02","2024-08-03")),
  version = as.Date(c("2024-08-01","2024-08-02","2024-08-03")),
  signal1 = c("XA", "XB", "XC")
)

s2 <- tibble(
  geo_value = c("ca", "ca", "ca"),
  time_value = as.Date(c("2024-08-01","2024-08-01", "2024-08-02")),
  version = as.Date(c("2024-08-02","2024-08-04","2024-08-03")),
  signal2 = c("YA", "YB", "YC")
)

s1 <- s1 %>% as_epi_archive()
s2 <- s2 %>% as_epi_archive()

merged <- epix_merge(s1, s2, sync="locf", compactify = FALSE)
merged[["DT"]]
geo_value time_value version signal1 signal2
ca 2024-08-01 2024-08-01 XA NA
ca 2024-08-01 2024-08-02 XA YA
ca 2024-08-01 2024-08-04 XA YB
ca 2024-08-02 2024-08-02 XB NA
ca 2024-08-02 2024-08-03 XB YC
ca 2024-08-03 2024-08-03 XC NA

here with COMPACTIFY = FALSE, we were wondering if there should be another version for 08-03-2024, for time value 08-01-2024. Where both signals are null?

However, if this was the case then for a given time value we would have a version for every date between the first version and last version, which could scale up very quickly. So I'm unsure if we would want to do this.

dshemetov commented 2 months ago

To clarify: the result can be seen as a mixed-compacted archive because there is no 20240803 version row for time_value 20240801, despite the values XA and YA already being recorded for both signals. So there's a consistency question as to what a compact and uncompact epi-archive implies. Does compact imply fully compact? Does non-compact imply all possible versions being listed? Should the outcome of a compactify = FALSE merge return a fully uncompact archive?

brookslogan commented 2 months ago

Archive DT's currently always represent updates --- these are like our DB tables, where they could represent diffs or they could represent re-recording the same value for various reasons. We initially had this more explicitly highlighted via naming, e.g. updates_dt or something like that. If the user wants to represent a removal, they currently have to manually put in a revision to NA and/or add a column tracking whether an associated signal value has been removed.

ryantibs(?) and I were discussing the possibility of making two different constructors: e.g., epi_archive_from_snapshots [which would insert some sort of "deletion" updates where appropriate] vs. epi_archive_from_{diffs/updates}, and we could maybe also similar extraction functions to sort of reverse these two operations. [Though we can't fully reverse compactification with the current format, since if there's a version completely identical to the last one, we omit any reference to it. This allows for a simpler archive format but might have motivated the perhaps-not-so-great version cadence inference stuff in epix_slide_ref_time_values_default.]

brookslogan commented 2 months ago

So in the current design: