ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
526 stars 111 forks source link

[experimental] option to estimate alignment score parameters from input pangenome data #1511

Closed glennhickey closed 1 week ago

glennhickey commented 3 weeks ago

All multiple alignment done by cactus (in BAR) use the scoring matrix from the config. This is currently set to

partialOrderAlignmentSubMatrix="91 -114 -61 -123 -100 -114 100 -125 -61 -100 -61 -125 100 -114 -100 -123 -61 -114 91 -100 -100 -100 -100 -100 100"
partialOrderAlignmentGapOpenPenalty1="400"
partialOrderAlignmentGapExtensionPenalty1="30"          
partialOrderAlignmentGapOpenPenalty2="1200"

These are derived from the defaults from lastz which cites it has the "HOXD70" matrix from Chiaromonte F, Yap VB, Miller W (2002). Scoring pairwise genomic sequence alignments. Pacific Symposium on Biocomputing 7:115-126.

But this matrix is very forgiving for substitutions, as you might expect for one designed to align very diverged genomes. For example, you can have a net-positive score by alternating between matches and transversions.

The upshot is that we can see cactus aligning right through all sorts of crazy regions, producing extremely messy graphs, wrong-looking graphs to the detriment of downstream aligning.

So this PR adds the --lastTrain option to cactus-pangenome (and cactus-minigraph). This option will use last-train from the last software package to estimate scores from the data (using the reference genome and most diverged genome to it from the input). These scores will be saved in last's .train format, and then imported into the cactus config before alignment. (cactus-align gets a --scoresFile option to take in the file too).

Since last only trains one gap model and abPOA wants two (and seems to crash if you don't give it one), a scaling factor (defaulting to 3 in the config) is used to create a long gap model that's more expensive to open but cheaper to extend.

The models learned using some test data are radically different than the HOX70 matrix, especially in that the penalize mismatches much more, which is evidence that the HOX70 isn't appropriate for most pangenome data.

I'd like to eventually incorporate something like this in progressive cactus but for now it's only for pangenomes. And it needs some more testing before it can be merged...