ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

Use smaller distance index for --hapl, and decouple from --giraffe #1424

Open glennhickey opened 5 days ago

glennhickey commented 5 days ago

Previously, you needed to run --hapl --giraffe clip to make the new haplotype subsampling index. This is because the .dist index generated by --giraffe is a requirement for making the .hapl index.

But making the .dist index of the clip graph in this way (as opposed to the filter graph) could take loads of time and memory. And, I just found out, vg haplotypes doesn't actually need a full distance index: it can get by on a top-level index constructed with vg index --snarl-limit 1.

For hprc-v1.1-mc-chm13.dist, the savings are substantial by using this option.

vg index hprc-v1.1-mc-chm13.xg -j top.dist --snarl-limit 1"
        Elapsed (wall clock) time (h:mm:ss or m:ss): 57:22.61
        Maximum resident set size (kbytes): 79057736

vg index hprc-v1.1-mc-chm13.xg -j default.dist"
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:54:50
        Maximum resident set size (kbytes): 258998352

This PR allows you to run --hapl without --giraffe. In this case, only the top-level distance index is created. It is used to make the .hapl index then thrown away. This removes a major memory bottleneck especially on large diverse graphs.