Caching match prediction calculations

Problem

Match prediction is an intensive process. It is also deterministic: same input will result in same output. Thus caching calculations could save on computation.

Input for match prediction:

subject data: HLA phenotype & assigned HF frequency set (determined by subject eth/reg codes)
allowed loci

Discussion

Stages at which caching could be applied:

match probability response
patient-donor genotype pair matching
subject imputation

Match probability response

Inputs: patient subject data, donor subject data, allowed loci
Output: match probability response
Considerations:
- This would be the biggest saving in terms of computation.
- However, not sure how much it would actually be used due to variability in patient-donor combinations.
- I.e., we could end up caching lots of results, but may not end up recalling most of them from the cache!

Patient-donor genotype pair (PDGP) matching

Inputs: single patient genotype, single donor genotype, allowed loci
Output: match counts per PDGP and per locus of the pair
Considerations:
- This step is essentially string matching
- Recall rate for individual PDGPs would be significant
- However, PDGP sets from a single patient-donor combo can get very large (max. 4 million objects due to genotype list truncation)
- So amount of cached data could balloon from just a single search!
- Further, 10s of millions of lookups per search is not ideal when using a distributed cache.

Subject imputation

Inputs: subject data (patient or donor), allowed loci
Output: truncated list of genotypes with their likelihoods
Considerations:
- Number of genotypes in the list can range from 0 ("inexplicable") to 2000 (capped as part of optimisation strategy)
- Number of lookups per search would range from 0 (no matching donors), 2 (one patient & one donor), to thousands ("worst case" search with thousands of distinct donor phenotypes)

Conclusion

The only part of match prediction that is feasible to cache is the Subject Imputation result.
At the very least, this would half the number of imputations, as currently the patient is re-processed during every patient-donor match probability request, and the patient data within a search never changes!

Cache Invalidation/Expiry

Memory management

Cache memory is limited so cached data should have TTL
Each entry should have a countdown to expiry that is reset every time the data is accessed, that way results from rarer phenotypes will expire quickly and not take up space in the cache
Could possibly persist imputation results to db as Sql server storage is relatively cheap, and table can be indexed by unique cache key which should mean performant lookups
- Could store imputation result as simple JSON blob
AddOrGetFromCache logic would then be:
1. If subject is in cache, then get results
2. If subject not in cache, then check db:
  - If db has results, then push to cache and return results.
  - Else, run subject imputation, persist results to db, push to cache, and return results.
May also want to have TTL on db entry to prevent bloating of the table - it would be much longer than cache expiry, and could be removed using a cron job.

HF Set Upload

Upload of a new HF set for a given population should invalidate the relevant subset of subject imputation results (both in cache and db layers)
But this is only true if the contents of the new HF set actually differ to the previously active one
HF set import should be extended with such a check to prevent unnecessary cache invalidation

Anthony-Nolan / Atlas

Caching match prediction calculations #1131