Open sivico26 opened 1 year ago
Cactus is aligning multiple tandem repeats together as paralogs. This touches on two issues:
1) If multiple copies are in every genome, then Cactus should leave them apart. It tries to do this already using outgroup information, but it is not perfect. This is something we are aware of and are working on (but it's tricky in the general case).
2) Even if tandem repeats are collapsed, MAF export should have a way to sensibly chain them if possible. The present logic is lacking in this department. I fixed a very similar issue here for the HAL browser. Something similar in hal2maf
or taffy
would be useful.
Things you can try now, as I unfortunately can't give a timeline for fixing the above issues:
use --dupeMode single
or at the very least --filterGapCausingDupes
with hal2maf
. The latter will prune some problem paralogs, leading to much cleaner alignments. The former will guarantee you at most one copy per line.
try using --root
to specify a subtree for cactus-hal2maf
with the hope that an outgroup will help keep paralogs apart.
try halSynteny
and work with the alignment chains (probably far from ideal).
increasing the chain size for cactus. For example setting deannealingRounds="2 32 512 2048"
in the config.xml. This is finicky though and you may get substantial tradeoffs in coverage for more contiguous alignments.
Thank you for the suggestions Glenn, I have a couple of questions.
The statistics I reported above are already using --filterGapCausingDupes
. My initial thinking was to use --dupeMode all
to get all the paralogs and decide how to handle them downstream (use some criterion to tell the orthologs from the paralogs, or simply not using those regions, etc.). How confident can I be of cactus
's ability to sort the orthologs when using --dupeMode single
? Am I being too cautious here? Please remember that my purpose is to do some phylogenomics inference.
I will certainly try to use --root
if that helps! can you specify multiple outgroups? If you can, does the order matter? In my case both iabu
and cgla
are outgroups, but I think the assemblies for cgla
are better, so I would prioritize it if that is possible. So, would --root cgla,iabu
work?
While preparing the data to send you, I noticed that exporting the MAF files with --raw
produces strikingly clean, but severely fragmented alignments. I felt tempted to export the alignments clean and try to implement a merger myself. Then I realized that I would be reinventing the wheel since this is what taffy
is doing and the problem is non-trivial at all, right? I would be better off playing with parameterizations than trying a sketchy implementation. What steps is taffy
(or more precisely cactus-hal2maf
) following for its postprocessing?
Finally, when you mention modifying the config.xml file, I found the following in my build:
cactus-bin-v2.6.4/src/cactus/cactus_progressive_config.xml
cactus-bin-v2.6.4/cactus_env/lib/python3.10/site-packages/cactus/cactus_progressive_config.xml
cactus-bin-v2.6.4/build/lib/cactus/cactus_progressive_config.xml
Which one is the one I should modify? Is the one on lib/
or the one under cactus_env
? My money is on the latter but I just want to make sure. Do I have to reinstall cactus
for the change to take effect or rewriting suffices?
Thanks again for your time.
--dupeMode single
should be fine. It takes the row that best aligns to the reference. It doesn't chain though, so you can get false re-arrangements between blocks. raw
then try different taffy
invocations yourself. The ones cactus-hal2maf
used should be in your log, and are also touched on in the taffy docs. As shown by your other issues, you need to avoid taffy add-gap-bases
and taffy norm -d
together until the next release. cactus_env
. I recommend copying it somewhere before editing. Then passing in the edited copy with --configFile modified-config.xml
Super! thanks for the recommendations!
Hi again,
As promised, continuing with the discussion started in #1201,
cactus
(ortaffy
?) sometimes breaks some sequences within the same alignment block for no reason, making it look like the affected taxa have several paralogs/duplicates. A closer inspection reveals that it is the same locus but fragmented. Take for instance the following alignment:Here,
ccar
andcdan
have multiple duplicates, but if you look closely, some of these sequences are contiguous. Take, for instance,ccar.contig_11
: on the third row, if you sum its start position and its size, it gives the start position of the previous row (49710 + 297 = 50007). The same applies to that row: 50007 + 88 = 50095. They are also in the same strand. And finally, if you look at the alignment block, they go one after the other without overlapping. In short, they are contiguous.So, correcting these "broken" sequences, is what I referred to intra-merging in the previous issue. After running intra-merging, this alignment block looks like this:
After doing this, we went from what seemed to be 4 paralogs for
ccar
and 3 forcdan
to the real 2 for each on this alignment block. Also, the alignment block is now more coherent and homogeneous. This is what I meant when I said thatcactus
/taffy
should not be doing this. Now, what is going wrong exactly, I am not sure. I can only imagine some heuristics are failing.Finally, this issue is quite pervasive in my data. To give you an idea, from the same loci I discussed in the previous issue. My raw alignments go from:
Before intra merging, to:
So I reduce the alignments with duplicates from more than 1/3 to ~ 1/5, just by doing intra-merging. To be fair, I am running a slightly more sophisticated version, where I also introduce gaps if the sequence positions do not match exactly but they are close enough (say 10 bp), so these numbers are rather a ceiling. Still, I can assure you that in most cases I do not have to introduce gaps because the matches are exact, so the numbers are not far off either.
Anyway, I hope this gives a better idea of what I meant. Cheers
Originally posted by @sivico26 in https://github.com/ComparativeGenomicsToolkit/cactus/issues/1201#issuecomment-1779156228