These are derived from the defaults from lastz which cites it has the "HOXD70" matrix from Chiaromonte F, Yap VB, Miller W (2002). Scoring pairwise genomic sequence alignments. Pacific Symposium on Biocomputing 7:115-126.
But this matrix is very forgiving for substitutions, as you might expect for one designed to align very diverged genomes. For example, you can have a net-positive score by alternating between matches and transversions.
The upshot is that we can see cactus aligning right through all sorts of crazy regions, producing extremely messy graphs, wrong-looking graphs to the detriment of downstream aligning.
So this PR adds the --lastTrain option to cactus-pangenome (and cactus-minigraph). This option will use last-train from the last software package to estimate scores from the data (using the reference genome and most diverged genome to it from the input). These scores will be saved in last's .train format, and then imported into the cactus config before alignment. (cactus-align gets a --scoresFile option to take in the file too).
Since last only trains one gap model and abPOA wants two (and seems to crash if you don't give it one), a scaling factor (defaulting to 3 in the config) is used to create a long gap model that's more expensive to open but cheaper to extend.
The models learned using some test data are radically different than the HOX70 matrix, especially in that the penalize mismatches much more, which is evidence that the HOX70 isn't appropriate for most pangenome data.
I'd like to eventually incorporate something like this in progressive cactus but for now it's only for pangenomes. And it needs some more testing before it can be merged...
All multiple alignment done by cactus (in BAR) use the scoring matrix from the config. This is currently set to
These are derived from the defaults from lastz which cites it has the "HOXD70" matrix from
Chiaromonte F, Yap VB, Miller W (2002). Scoring pairwise genomic sequence alignments. Pacific Symposium on Biocomputing 7:115-126.
But this matrix is very forgiving for substitutions, as you might expect for one designed to align very diverged genomes. For example, you can have a net-positive score by alternating between matches and transversions.
The upshot is that we can see cactus aligning right through all sorts of crazy regions, producing extremely messy graphs, wrong-looking graphs to the detriment of downstream aligning.
So this PR adds the
--lastTrain
option tocactus-pangenome
(andcactus-minigraph
). This option will uselast-train
from the last software package to estimate scores from the data (using the reference genome and most diverged genome to it from the input). These scores will be saved in last's.train
format, and then imported into the cactus config before alignment. (cactus-align
gets a--scoresFile
option to take in the file too).Since
last
only trains one gap model andabPOA
wants two (and seems to crash if you don't give it one), a scaling factor (defaulting to 3 in the config) is used to create a long gap model that's more expensive to open but cheaper to extend.The models learned using some test data are radically different than the HOX70 matrix, especially in that the penalize mismatches much more, which is evidence that the HOX70 isn't appropriate for most pangenome data.
I'd like to eventually incorporate something like this in progressive cactus but for now it's only for pangenomes. And it needs some more testing before it can be merged...