ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
523 stars 111 forks source link

Turn abpoa seeding off by default #1325

Closed glennhickey closed 7 months ago

glennhickey commented 7 months ago

abPOA has an option to speed up alignment by using a minimizer-based seeding strategy to find anchors. But, in rare cases it can crash or, more worryingly, output a completely incorrect result -- which happens seems system-dependent.

This PR turns it off by default in the configuration file. I ran two tests with Cactus v.2.8.0 with seeding on and off. chr10 from the the chm13-based v1.1 HPRC pangenome, and Anc08 (mouse/rat + outgroups) from the Zoonomia 10-way test.

The cactus_consolidated running times are

        Seed   Cons Time (s) BAR time (s) RAM (GB)
----------------------------------------------------
Anc08   Yes     36,543       20,478        406
Anc08   No      45,153       26,025        406
chr10   Yes     14,413       12,871        88
chr10   No      13,588       12,109        87

Coverage stats were unaffected for chr10 but for Anc08 they are a bit different -- turning seeding off increasing the coverage by 650kb (though rat self coverage goes down).

Seeding On
Rat, 2870182909, 190817798, 23945244, 6914529, 2285777, 892402, 371914, 76304, 0
Mouse, 1784934899, 13658429, 3662436, 1037931, 392764, 144415, 64406, 32223, 7431
Anc08, 2, 1849546954, 4210, 0, 106111161

Seeding Off
Rat, 2870182909, 187215704, 22936994, 6522026, 2267026, 928386, 433536, 90816, 0
Mouse, 1785594575, 13866817, 3701876, 1080785, 408748, 151560, 65168, 29518, 5820
Anc08, 2, 1845657012, 4133, 0, 106126779

The 100 fewer minutes in the mouse rat alignment doesn't seem worth keeping seeding on. I think I'd initially enabled it to keep cloud costs down on large pangenome alignemnts even if accuracy was a bit lower, but seeding's only slowing things down on the chr10 test.