broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
BSD 3-Clause "New" or "Revised" License
162 stars 71 forks source link

Reimplement ParseGenotypes in GenotypeComplexVariants #614

Closed mwalker174 closed 3 weeks ago

mwalker174 commented 8 months ago

A critical optimization to the ParseGenotypes task that reimplements process_posthoc_cpx_depth_regenotyping.py with greatly accelerated computations.

The previous version of the script used many repetitive quadratic (N^2) commands that caused it to grind to a halt on a 98,000-sample call set in problematic regions, with shards taking >1 week (possibly longer) to complete. A call cached run of the same workflow on chr2 took <24 hr to run and cost $2.73, requiring <15GB memory per shard in ParseGenotypes.

In a few places, I've pointed out some possible bugs with #TODOs. In the interest of preserving exact functionality, I did not attempt to address these (they mostly seemed minor), but they should be revisited in the future.

An issue where two UNRESOLVED filter statuses were being applied to some records was also corrected.

The new implementation was tested on the 1KGP reference panel and produced identical output through CleanVcf, with the exception of one record reflecting a rare edge case that previously resulted in a false positive large CPX CNV (relating to an INS record with END<<POS).