PlantandFoodResearch / MCHap

Polyploid micro-haplotype assembly using Markov chain Monte Carlo simulation.
MIT License
18 stars 3 forks source link

Avoid holding alignment file handles open. #173

Closed timothymillar closed 5 months ago

timothymillar commented 6 months ago

This is essentially a bug introduced in v0.9.0 (#169) where alignment file handles are kept open between loci to minimize the CPU overhead incurred from opening CRAM files (which involves validating contigs). Holding the files open substantially improved speed but it was a bit shortsighted as we don't want to limit analyses based on the number of available file handles on a system.

This also introduced a massive memory overhead with the find-snvs tool in particular. find-snvs uses pysam.pileup which apparently can hold quite a bit of state in memory until the handle is closed (probably in the C code so not cleared by the python GC). In one example this appears to be 7-15MB per CRAM file which quickly gets into the GBs when running over 1000s of samples.

The sweet spot here seems to be closing the alignment files ASAP and reopening them when needed but always pass an explicit reference genome path which seems to avoid some of the CRAM overhead (presumably this overhead is validating the linked reference path which is stored in the CRAM header).

timothymillar commented 5 months ago

Fixed in #175