Open brookslogan opened 2 years ago
Tentatively marking this as P2, unless epipredict development reveals that it is a major issue.
Partially addressed in https://github.com/cmu-delphi/epiprocess/pull/386. Some additional performance ideas here and here.
Not sure where to post this, but this seems like a close enough issue. Coming from a discussion with @brookslogan @dsweber2. We're looking for a "sparse archive double slide" function. What we mean is constructing features for every as_of in an archive, but using a smarter approach than repeatedly calling epix_as_of and then applying epi_slide, then converting back to an archive and compacitfying. The faster approach would compactify as you go.
Currently,
epix_slide
andas_of
have reasonable performance for pseudoprospective analysis of some state-level forecasters, such as those predicting 23 quantiles usingrq
. However, for more efficient forecasters (usingrq.fit
,lm
, orlm.fit
) or simpler operations such as 7-day averaging, the overhead becomes noticeable or can even dwarf the actual operation's computation time. Some ideas for improvements:data.table
, stratify by some/all non-version key columns, or chunks of non-version key column values. Drop parts of the key in these table chunks if it helps with memory/speed. Filtering based on key-without-version values can be done in another data structure that holds all these table chunks together. The goal would be to (a) try to avoid any searching for the non-version part of the key that might be done for every non-version key value as part ofas_of
, and (b) try to avoid extra work in the version search from operations on non-version parts of the key. Stratifying too finely could be counterproductive due to (a) overhead looping over chunks in R and FFI (if looping is done in R), and (b) reduced spatial locality.$slide
to do something smarter than looking up each requested snapshot separately; instead, try to apply updates to get from one snapshot to the next; or start searching for the next version of each observation from where the previous one was located; or checking whether the previous version of some value will still be there in the next version, and only performing lookups for those that will change. Using a different key ordering and/or secondary index might help here. [And/or some tricks withas.list
, thesorted
attribute, andsetDT
. But these tricks might not be necessary; if there is an efficient issue-set partitioning function / application ofby
&.SD
, the DT could be split into parts with issue <= the min(ref_time_value), min(ref_time_value) < issue <= secondlowest(ref_time_value), etc., each part smashed into a simplified update using rolling joins, and then the slide simply applying the updates sequentially... or some modification of this to handle the simultaneously rolling time_value window.]all_versions=TRUE
, consider performing the partition above and iteratively appending to get the appropriate DTs, rather than filtering the original. Or figure out some way to not have to form a separate DT at all, like a filter view subclass.DT
; then, if an update changes one value with a particular non-version-key, it must re-report all unchanged values with the same non-version-key as well. For example, if the DT contains full snapshots for every version, then the non-version-key could be empty. Or the non-version-key could be just thetime_value
if every geo's data for a time value is re-reported if any geo's data for that time value has been revised. This reduces the key size and maybe the number of lookups required, but will probably necessitate some changes to howas_of
, etc., are implemented.epi_df
in$as_of
.version_lag
rather thanversion
, transforming some computations required.version_lag
as the first part of the key, allowing fast filters based onbefore
, but requiring the use ofunique
rather than potentially using rolling joins (although an index could be used to speed these up).version_lag
column and storing an index for it, to speed upbefore
filter.Any performance improvements here should be planned together with considerations of alternatives to storing data in RAM. E.g., if the RAM storage is abandoned, then some of these potential improvements may not apply, or would need to be implemented in a different way. [They should also be coordinated with any move to S3+dtplyr; completing these changes in parallel would likely involve substantial merge conflicts.]
[Consider also
tibble::as_tibble()
-> usingsetDF
inas_of
, although may need checks that original DT or DT columns aren't aliased in exceptional circumstances.]