Avoid reading all genome's sequences when possible

hal2maf's memory woes, at least on the big zoonomia alignment, came from an unexpected place: Whenever Hdf5Genome::getSequenceBySite() is called (which is every time a column iterator touches a genome), it accesses _sequencePosCache. If the cache is empty, then it is loaded into memory.

The problem is that there are so many sequences in the whole alignment (100's of millions) that even though each one is just a few pointers, the resulting indexes cost tens of gigs, limiting the number of hal2maf processes that can be run in parallel. Since the column iterator can touch every genome pretty much right away, this causes immediate problems.

The good news is that hal2maf only ever touches this cache via getSequenceBySite() (as opposed to other methods that may rely on the name cache). So we can help things substantially with the following heuristic:

if they were fewer than 1000 sequences (Hdf5Genome::maxPosCache), just fill the cache as before
otherwise, check the cache, if the sequence is there great, otherwise binary search the external array and add the sequence to the cache once its found and return it.

On this command

/usr/bin/time -v ./hal2maf  241-mammalian-2020v2.0.hal  test.maf  --refGenome Homo_sapiens --refSequence chr20 --start 50000000 --length 100000 --noDupes --noAncestors 2> log

these changes reduce memory consumption from 18.5 G to 0.76 G.

Apparently loading all these tiny contigs had an impact on running time, as that was reduced from 23m to 3m

ComparativeGenomicsToolkit / hal

Avoid reading all genome's sequences when possible #257