Adjust default HDF5 caching behaviour

ComparativeGenomicsToolkit / hal

Hierarchical Alignment Format

Other

164 stars 39 forks source link

For anything but huge files, --inMemory is usually the only sensible option.

But... today I learned that for huge files (200+ genomes including ancestors) HAL quietly disables caching altogether, which renders them virtually unusable.

So this PR stops doing that, but also substantially decreases the default cache sizes in order to blow up less badly on giant files.

What's funny is that the comment about disabling the cache goes ... (well over 17GB for just running halStats on a 250-genome alignment) ...

and with the cache disabled, hal2maf regularly uses exactly 17G on just such files. This goes up to 17.5G with the new caching parameters enabled.

So there remains a bit of a mystery here: where is the 17G actually coming from?

Huh, according to valgrind, most of the memory used in hal2maf is coming from Hdf5Genome's sequence cache

Every time a genome gets touched by a column iterator, it does a sequence-by-position lookup, which triggers filling the entire cache for that genome. A single column can touch pretty much every genome in the alignment. This is more than 100,000,000 sequences (contigs) worth in the big mammals alignment. What's worse, the Hdf5Sequence* pointers are treated as persistent by some client code, meaning I don't think they could be safely erased by a lru strategy.

I think the current system isn't so unreasonable in and of itself, but it does become a bottleneck when trying to run many hal2maf processes on one big file. The alternative of indexing them on-disk with hdf5 is too big a refactor to consider now. But for cactus-hal2maf there may be a way to reduce memory by only filling bits of the sequence cache at a time.

ComparativeGenomicsToolkit / hal

Adjust default HDF5 caching behaviour #256