Consider performance improvements for `epix_slide`, `as_of`

brookslogan commented 2 years ago

Currently, epix_slide and as_of have reasonable performance for pseudoprospective analysis of some state-level forecasters, such as those predicting 23 quantiles using rq. However, for more efficient forecasters (using rq.fit, lm, or lm.fit) or simpler operations such as 7-day averaging, the overhead becomes noticeable or can even dwarf the actual operation's computation time. Some ideas for improvements:

Rather than use one huge data.table, stratify by some/all non-version key columns, or chunks of non-version key column values. Drop parts of the key in these table chunks if it helps with memory/speed. Filtering based on key-without-version values can be done in another data structure that holds all these table chunks together. The goal would be to (a) try to avoid any searching for the non-version part of the key that might be done for every non-version key value as part of as_of, and (b) try to avoid extra work in the version search from operations on non-version parts of the key. Stratifying too finely could be counterproductive due to (a) overhead looping over chunks in R and FFI (if looping is done in R), and (b) reduced spatial locality.
[Store key columns internally as integers only, with auxiliary data structures to convert between non-integer classes and internal integer representations? (Snowflake schema / interning.)]
Customize $slide to do something smarter than looking up each requested snapshot separately; instead, try to apply updates to get from one snapshot to the next; or start searching for the next version of each observation from where the previous one was located; or checking whether the previous version of some value will still be there in the next version, and only performing lookups for those that will change. Using a different key ordering and/or secondary index might help here. [And/or some tricks with as.list, the sorted attribute, and setDT. But these tricks might not be necessary; if there is an efficient issue-set partitioning function / application of by&.SD, the DT could be split into parts with issue <= the min(ref_time_value), min(ref_time_value) < issue <= secondlowest(ref_time_value), etc., each part smashed into a simplified update using rolling joins, and then the slide simply applying the updates sequentially... or some modification of this to handle the simultaneously rolling time_value window.]
- Similarly, for all_versions=TRUE, consider performing the partition above and iteratively appending to get the appropriate DTs, rather than filtering the original. Or figure out some way to not have to form a separate DT at all, like a filter view subclass.
Allow duplicate non-version-key values in the DT; then, if an update changes one value with a particular non-version-key, it must re-report all unchanged values with the same non-version-key as well. For example, if the DT contains full snapshots for every version, then the non-version-key could be empty. Or the non-version-key could be just the time_value if every geo's data for a time value is re-reported if any geo's data for that time value has been revised. This reduces the key size and maybe the number of lookups required, but will probably necessitate some changes to how as_of, etc., are implemented.
[Use native code? E.g., Rust via rextendr, which may not require a license change and might be directly Python-portable, or Rcpp which has tighter R integration but might require a license change.]
Avoid an unnecessary copy that likely occurs when converting each snapshot into an epi_df in $as_of.
Try storing version_lag rather than version, transforming some computations required.
Try putting version_lag as the first part of the key, allowing fast filters based on before, but requiring the use of unique rather than potentially using rolling joins (although an index could be used to speed these up).
Adding a version_lag column and storing an index for it, to speed up before filter.
Add options for warm starts from nearby times/groups.
[Consider alternative backends: different data frame variants, using one data frame variant per epikey to see if reducing the key length at the cost of more function applications is overall helpful / enabling hash/other approaches to epikey lookups faster than having them be part of the binary search over epikey-time-versions or whatever other algorithm data.table uses. If function call overhead or memory nonlocality is an issue, then maybe a chunked approach could be considered, though this probably requires at least one additional key column in the data.tables.]
...

Any performance improvements here should be planned together with considerations of alternatives to storing data in RAM. E.g., if the RAM storage is abandoned, then some of these potential improvements may not apply, or would need to be implemented in a different way. [They should also be coordinated with any move to S3+dtplyr; completing these changes in parallel would likely involve substantial merge conflicts.]

[Consider also tibble::as_tibble() -> using setDF in as_of, although may need checks that original DT or DT columns aren't aliased in exceptional circumstances.]

brookslogan commented 2 years ago

Tentatively marking this as P2, unless epipredict development reveals that it is a major issue.

nmdefries commented 5 months ago

Partially addressed in https://github.com/cmu-delphi/epiprocess/pull/386. Some additional performance ideas here and here.

dshemetov commented 2 months ago

Not sure where to post this, but this seems like a close enough issue. Coming from a discussion with @brookslogan @dsweber2. We're looking for a "sparse archive double slide" function. What we mean is constructing features for every as_of in an archive, but using a smarter approach than repeatedly calling epix_as_of and then applying epi_slide, then converting back to an archive and compacitfying. The faster approach would compactify as you go.

cmu-delphi / epiprocess

Consider performance improvements for `epix_slide`, `as_of` #76