Match prediction is an intensive process.
It is also deterministic: same input will result in same output.
Thus caching calculations could save on computation.
Input for match prediction:
subject data: HLA phenotype & assigned HF frequency set (determined by subject eth/reg codes)
allowed loci
Discussion
Stages at which caching could be applied:
match probability response
patient-donor genotype pair matching
subject imputation
Match probability response
Inputs: patient subject data, donor subject data, allowed loci
Output: match probability response
Considerations:
This would be the biggest saving in terms of computation.
However, not sure how much it would actually be used due to variability in patient-donor combinations.
I.e., we could end up caching lots of results, but may not end up recalling most of them from the cache!
Patient-donor genotype pair (PDGP) matching
Inputs: single patient genotype, single donor genotype, allowed loci
Output: match counts per PDGP and per locus of the pair
Considerations:
This step is essentially string matching
Recall rate for individual PDGPs would be significant
However, PDGP sets from a single patient-donor combo can get very large (max. 4 million objects due to genotype list truncation)
So amount of cached data could balloon from just a single search!
Further, 10s of millions of lookups per search is not ideal when using a distributed cache.
Inputs: subject data (patient or donor), allowed loci
Output: truncated list of genotypes with their likelihoods
Considerations:
Number of genotypes in the list can range from 0 ("inexplicable") to 2000 (capped as part of optimisation strategy)
Number of lookups per search would range from 0 (no matching donors), 2 (one patient & one donor), to thousands ("worst case" search with thousands of distinct donor phenotypes)
At the very least, this would half the number of imputations, as currently the patient is re-processed during every patient-donor match probability request, and the patient data within a search never changes!
Cache Invalidation/Expiry
Memory management
Cache memory is limited so cached data should have TTL
Each entry should have a countdown to expiry that is reset every time the data is accessed, that way results from rarer phenotypes will expire quickly and not take up space in the cache
Could possibly persist imputation results to db as Sql server storage is relatively cheap, and table can be indexed by unique cache key which should mean performant lookups
Could store imputation result as simple JSON blob
AddOrGetFromCache logic would then be:
If subject is in cache, then get results
If subject not in cache, then check db:
If db has results, then push to cache and return results.
Else, run subject imputation, persist results to db, push to cache, and return results.
May also want to have TTL on db entry to prevent bloating of the table - it would be much longer than cache expiry, and could be removed using a cron job.
HF Set Upload
Upload of a new HF set for a given population should invalidate the relevant subset of subject imputation results (both in cache and db layers)
But this is only true if the contents of the new HF set actually differ to the previously active one
HF set import should be extended with such a check to prevent unnecessary cache invalidation
First implementation of solution will persist the imputation results to an existing data store, most likely the shared sql server db.
Redis can reduce latency but is expensive, and our primary goal is to save cost, not necessarily reduce search times (though caching should achieve that as well).
We can revisit the use of a distributed cache for match prediction once we have a clearer idea of how much data is being generated from live search and its rate of recall.
Implementation:
Can reduce the amount of chars to cache by storing a pair of Haplotype IDs to represent each genotype, rather than the HLA itself.
Problem
Match prediction is an intensive process. It is also deterministic: same input will result in same output. Thus caching calculations could save on computation.
Input for match prediction:
Discussion
Stages at which caching could be applied:
Match probability response
Patient-donor genotype pair (PDGP) matching
Subject imputation
Conclusion
Cache Invalidation/Expiry
Memory management
AddOrGetFromCache
logic would then be:HF Set Upload