Extract Cohort optimizations [VS-493] [VS-1516]

mcovarr commented 3 days ago

Integration test run here.

Follows the bread crumbs in VS-493 to drop filter info and sites outside of variant locations for the samples being extracted.

Test runs here:

Findings:

This data set is the biggest non-AoU dataset we have but is decidedly on the small side for being able to measure the effects of changes like this.
For the particular runs linked above, the code on this branch dropped ~85% of unnecessary filter info.
Presumably because there is less work being done, these extracts run more quickly than the baseline code.
Questionable if these changes would help with "small subsets". In the "small subset" use cases we extract all samples, though only the variant data over a specified interval list. I don't think the current logic is informed by interval lists; we should look into this further.

koncheto-broad commented 3 days ago

the correctness comparisons I mentioned are between your subcohort BGE extracts from the WGS 3k callset pulling from ah_var_store and this branch. And in theory you can also look at memory usage between them and document it to see if the code affects sub-region extracts as well as subcohort extracts (although those results will not gate this PR being merged)

mcovarr commented 2 days ago

I did do the BGE correctness comparisons mentioned above and everything tied out perfectly wrt ah_var_store, dropping > 99% of filter set info and > 98% of filter set sites. The runtimes of these extracts are even shorter than they were for the WGS dataset so the graphs are not going to be terribly informative. I'm thinking to reach out to see if we can run this code against a larger AoU dataset after the break.

broadinstitute / gatk

Extract Cohort optimizations [VS-493] [VS-1516] #9055