Potential design issue with epix_merge COMPACTIFY

rnayebi21 commented 2 months ago

Was talking with @dshemetov about an example I came up with for epix_merge() see below:

s1 <- tibble(
  geo_value = c("ca", "ca", "ca"),
  time_value = as.Date(c("2024-08-01","2024-08-02","2024-08-03")),
  version = as.Date(c("2024-08-01","2024-08-02","2024-08-03")),
  signal1 = c("XA", "XB", "XC")
)

s2 <- tibble(
  geo_value = c("ca", "ca", "ca"),
  time_value = as.Date(c("2024-08-01","2024-08-01", "2024-08-02")),
  version = as.Date(c("2024-08-02","2024-08-04","2024-08-03")),
  signal2 = c("YA", "YB", "YC")
)

s1 <- s1 %>% as_epi_archive()
s2 <- s2 %>% as_epi_archive()

merged <- epix_merge(s1, s2, sync="locf", compactify = FALSE)
merged[["DT"]]

geo_value	time_value	version	signal1	signal2
ca	2024-08-01	2024-08-01	XA	NA
ca	2024-08-01	2024-08-02	XA	YA
ca	2024-08-01	2024-08-04	XA	YB
ca	2024-08-02	2024-08-02	XB	NA
ca	2024-08-02	2024-08-03	XB	YC
ca	2024-08-03	2024-08-03	XC	NA

here with COMPACTIFY = FALSE, we were wondering if there should be another version for 08-03-2024, for time value 08-01-2024. Where both signals are null?

However, if this was the case then for a given time value we would have a version for every date between the first version and last version, which could scale up very quickly. So I'm unsure if we would want to do this.

dshemetov commented 2 months ago

To clarify: the result can be seen as a mixed-compacted archive because there is no 20240803 version row for time_value 20240801, despite the values XA and YA already being recorded for both signals. So there's a consistency question as to what a compact and uncompact epi-archive implies. Does compact imply fully compact? Does non-compact imply all possible versions being listed? Should the outcome of a compactify = FALSE merge return a fully uncompact archive?

brookslogan commented 2 months ago

Archive DT's currently always represent updates --- these are like our DB tables, where they could represent diffs or they could represent re-recording the same value for various reasons. We initially had this more explicitly highlighted via naming, e.g. updates_dt or something like that. If the user wants to represent a removal, they currently have to manually put in a revision to NA and/or add a column tracking whether an associated signal value has been removed.

ryantibs(?) and I were discussing the possibility of making two different constructors: e.g., epi_archive_from_snapshots [which would insert some sort of "deletion" updates where appropriate] vs. epi_archive_from_{diffs/updates}, and we could maybe also similar extraction functions to sort of reverse these two operations. [Though we can't fully reverse compactification with the current format, since if there's a version completely identical to the last one, we omit any reference to it. This allows for a simpler archive format but might have motivated the perhaps-not-so-great version cadence inference stuff in epix_slide_ref_time_values_default.]

brookslogan commented 2 months ago

So in the current design:

"compact" / compactify=TRUE does imply fully compact (as far as I can tell, at least; things seem potentially trickier with the the numerical tolerance options David added, but AFAIK one "epsilon-compactify" step does produce a fully "epsilon-compact" archive)
Non-compact does NOT imply all possible versions / full snapshots being listed.
compactify = FALSE should not return a fully uncompact archive

cmu-delphi / epiprocess

Potential design issue with epix_merge COMPACTIFY #515