ComparativeGenomicsToolkit / hal

Hierarchical Alignment Format
Other
164 stars 39 forks source link

Avoid reading all genome's sequences when possible #257

Closed glennhickey closed 2 years ago

glennhickey commented 2 years ago

hal2maf's memory woes, at least on the big zoonomia alignment, came from an unexpected place: Whenever Hdf5Genome::getSequenceBySite() is called (which is every time a column iterator touches a genome), it accesses _sequencePosCache. If the cache is empty, then it is loaded into memory.

The problem is that there are so many sequences in the whole alignment (100's of millions) that even though each one is just a few pointers, the resulting indexes cost tens of gigs, limiting the number of hal2maf processes that can be run in parallel. Since the column iterator can touch every genome pretty much right away, this causes immediate problems.

The good news is that hal2maf only ever touches this cache via getSequenceBySite() (as opposed to other methods that may rely on the name cache). So we can help things substantially with the following heuristic:

On this command

/usr/bin/time -v ./hal2maf  241-mammalian-2020v2.0.hal  test.maf  --refGenome Homo_sapiens --refSequence chr20 --start 50000000 --length 100000 --noDupes --noAncestors 2> log

these changes reduce memory consumption from 18.5 G to 0.76 G.

Apparently loading all these tiny contigs had an impact on running time, as that was reduced from 23m to 3m