broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.72k stars 594 forks source link

Extract Cohort optimizations [VS-493] [VS-1516] #9055

Open mcovarr opened 3 days ago

mcovarr commented 3 days ago

Integration test run here.

Follows the bread crumbs in VS-493 to drop filter info and sites outside of variant locations for the samples being extracted.

Test runs here:

Findings:

koncheto-broad commented 3 days ago

the correctness comparisons I mentioned are between your subcohort BGE extracts from the WGS 3k callset pulling from ah_var_store and this branch. And in theory you can also look at memory usage between them and document it to see if the code affects sub-region extracts as well as subcohort extracts (although those results will not gate this PR being merged)

mcovarr commented 2 days ago

I did do the BGE correctness comparisons mentioned above and everything tied out perfectly wrt ah_var_store, dropping > 99% of filter set info and > 98% of filter set sites. The runtimes of these extracts are even shorter than they were for the WGS dataset so the graphs are not going to be terribly informative. I'm thinking to reach out to see if we can run this code against a larger AoU dataset after the break.