hal2maf's memory woes, at least on the big zoonomia alignment, came from an unexpected place: Whenever Hdf5Genome::getSequenceBySite() is called (which is every time a column iterator touches a genome), it accesses _sequencePosCache. If the cache is empty, then it is loaded into memory.
The problem is that there are so many sequences in the whole alignment (100's of millions) that even though each one is just a few pointers, the resulting indexes cost tens of gigs, limiting the number of hal2maf processes that can be run in parallel. Since the column iterator can touch every genome pretty much right away, this causes immediate problems.
The good news is that hal2maf only ever touches this cache via getSequenceBySite() (as opposed to other methods that may rely on the name cache). So we can help things substantially with the following heuristic:
if they were fewer than 1000 sequences (Hdf5Genome::maxPosCache), just fill the cache as before
otherwise, check the cache, if the sequence is there great, otherwise binary search the external array and add the sequence to the cache once its found and return it.
hal2maf
's memory woes, at least on the big zoonomia alignment, came from an unexpected place: WheneverHdf5Genome::getSequenceBySite()
is called (which is every time a column iterator touches a genome), it accesses_sequencePosCache
. If the cache is empty, then it is loaded into memory.The problem is that there are so many sequences in the whole alignment (100's of millions) that even though each one is just a few pointers, the resulting indexes cost tens of gigs, limiting the number of hal2maf processes that can be run in parallel. Since the column iterator can touch every genome pretty much right away, this causes immediate problems.
The good news is that
hal2maf
only ever touches this cache viagetSequenceBySite()
(as opposed to other methods that may rely on the name cache). So we can help things substantially with the following heuristic:Hdf5Genome::maxPosCache
), just fill the cache as beforeOn this command
these changes reduce memory consumption from
18.5 G
to0.76 G
.Apparently loading all these tiny contigs had an impact on running time, as that was reduced from
23m
to3m