Open benbelow opened 2 years ago
If we wanted an accurate assessment of difficulty, we would need to do the first step of match prediction (genotype expansion).
A less accurate assessment would be to use the HLA name categoriser to count how many alleles are in the subject phenotype. Also, count the number of typed loci, as missing loci add more complexity to match prediction than low res typings. These counts could be used in combination as the basis for "smart" batching.
Typings that have allele count and typed count of 5 are likely to be "high res" (i.e., map to only one small g group at each position) - it would make sense to put all such high-res typings in one batch, as they require relatively little processing.
The worst donors: typed count of 3 could be split to one donor per batch, to maximise use of horizontal scaling.
All remaining donors could be batched as is, or in another "smart" way (e.g., ensuring even distribution of donors with typing count of 4). But it might be worth just doing the simplest approach as a spike first to see if it makes any difference.
Donors are batched in sizes according to the MPA batch size config.
Donors can also vary a lot in the difficulty to run MPA on them - very well typed donors will run significantly more quickly than very ambiguous ones.
The theory here is that if several ambiguous donors are batched together, we don't benefit from parallelisation of those donors - whereas if we batched low res donors with high res ones, we could ensure that we're making the most of our parallelisation.
This issue warrants some investigation - smart batching like this will require some way of checking the "difficulty" of a donor up front, and we need to be very careful than coming up with a method of assigning "difficulty" does not slow down the algorithm more than the gains in efficient batching (which is a very real possibility!)