ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
499 stars 109 forks source link

cactus breaks contiguous `s` lines in MAF creating artificial duplicates #1202

Open sivico26 opened 10 months ago

sivico26 commented 10 months ago

Hi again,

As promised, continuing with the discussion started in #1201, cactus (or taffy?) sometimes breaks some sequences within the same alignment block for no reason, making it look like the affected taxa have several paralogs/duplicates. A closer inspection reveals that it is the same locus but fragmented. Take for instance the following alignment:

a 
s ref.Cexcelsa_scaf_6:24819913-24982272                201   533 +          162359 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCTCGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTGAATGAAAA
s caes.contig_2                                       5513   533 +           37599 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCTCGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cang.contig_2                                      11207   531 +          155896 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGTGGCCGAAGATCTCGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTG--TTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s ccar.contig_11                                     50095   145 -           59783 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TGAGGTATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACACATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s ccar.contig_11                                     50007    88 -           59783 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------CGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCG-----------------------------------------------------------------------------------------------------------------------------------------------------------------
s ccar.contig_11                                     49710   297 -           59783 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTTAATCTATATACCAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GAAAAA----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
s ccar.contig_11                                     15097   526 +           59783 ACCTTAGCTTTAGATTCAATAAAGTCATCAAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCATAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTCAATCTATATATTATTGTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTATGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAAAAAAGAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GAAAAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTTGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATA-------------CGATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTT-GCAGAGGTAAATGAAAA
s cdan.contig_3                                       5258   525 +           10084 ACCTTAGCTTTAGATTCAATAAAGTCATCAAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCATAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTCAATCTATATATTATTGTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTATGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAAAAA-GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GAAAAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATA-------------CGATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTT-GCAGAGGTAAATGAAAA
s cdan.contig_2                                       8699   236 -          120241 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------CGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cdan.contig_2                                       8401   298 -          120241 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATGAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAAA---GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GAAAAA----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
s cexc.contig_9                                       6541   533 -           42475 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGTGGCCGAAGATCTCGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cgla.contig_3                                       7673   531 +           84094 ACCTTAGCTTTAGATTCAATAAAGTCATCCAAACTCAGCGGCCGCAGGCCACAGGACGCATCCGCTCTTGTACCCTATAGAGACGAACAAGAAAAAA--GCATAAGGGCAAAAAAGCCGTGTAAGCGACAAAGTCT-ATGTCCT-A-AATTCAATTTATTTACT---GTACCTTTTGTTCTTCCTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAA-TACAAAA--CCCACAAAAAAA----GAATTAAAGATGAAGAA--AA-GTACATACAAGATGGGATC---G-GAAAACGAATACTAACCTTGAGATCGCTACCGGAATAACCTTCGGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCTGTTTCCAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCGACATATATCCTA--GAGTTCACAGGTAATAAACATCCTTTCAGTAAAGCGAGGAGTCAC------TGCTAAAA-CAGACAA-TCAATCTTCACAGACGCGAATGAAAA
s cgro6.contig_1                                      9851   533 -           68811 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACTGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cgro7.contig_1                                     10812   533 -           35331 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACTGGATTCGAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s coff.contig_8                                       7688   532 -           81550 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATAAAAAAACCCCGCAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGT-TTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cpol.contig_17                                       876   533 -           42026 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCTCGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTA---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cpyr.contig_1                                       8691   533 -           69638 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCTCGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTGAATGAAAA
s ctat.contig_1                                      11288   533 +           66057 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCAGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCGAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s iabu.contig_2                                       4456   547 -           14315 ACCTTAGCTTTAGATTCAATGAAGTCATCCAAACTCAGCGGCCGCAGGCCACAAGACGCATCTGTTCTTGCACCCTAAAGAGACAAACAAGACAAAAAAGCATAAGCGCAAAAA-GCCGTGTAAGCGGCAAAGTCT-ATGTCCTTAAAATTCAATCTATATAATATTGTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAAATTCTGCAAATAAAAAA--CCCACAAAAAAG----AAATTATAGATGAAGAATAAAAGTACATACAAGATGTGCTTTGAAAGAAAACGAATATTAACCTTGAGATCGCTACCGGAATAGCCTTCGGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCAAGATTCTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGCAAGTCCACATATATACTATAGAGTTCACAGATAATAA-CATTCTTTTAATGCGACGAGGAGTCAC------TGTTAAAAACAGACAAAACAATCTTCACA-AGGCAGATGAGAA

Here, ccar and cdan have multiple duplicates, but if you look closely, some of these sequences are contiguous. Take, for instance, ccar.contig_11: on the third row, if you sum its start position and its size, it gives the start position of the previous row (49710 + 297 = 50007). The same applies to that row: 50007 + 88 = 50095. They are also in the same strand. And finally, if you look at the alignment block, they go one after the other without overlapping. In short, they are contiguous.

So, correcting these "broken" sequences, is what I referred to intra-merging in the previous issue. After running intra-merging, this alignment block looks like this:

a score=0.00
s ref.Cexcelsa_scaf_6:24819913-24982272                201   533 +          162359 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCTCGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTGAATGAAAA
s caes.contig_2                                       5513   533 +           37599 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCTCGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cang.contig_2                                      11207   531 +          155896 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGTGGCCGAAGATCTCGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTG--TTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s ccar.contig_11                                     15097   526 +           59783 ACCTTAGCTTTAGATTCAATAAAGTCATCAAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCATAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTCAATCTATATATTATTGTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTATGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAAAAAAGAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GAAAAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTTGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATA-------------CGATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTT-GCAGAGGTAAATGAAAA
s ccar.contig_11                                     49710   530 -           59783 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTTAATCTATATACCAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GAAAAA-CGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCG---TGAGGTATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACACATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cdan.contig_2                                       8401   534 -          120241 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATGAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAAA---GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GAAAAA-CGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cdan.contig_3                                       5258   525 +           10084 ACCTTAGCTTTAGATTCAATAAAGTCATCAAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCATAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTCAATCTATATATTATTGTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTATGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAAAAA-GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GAAAAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATA-------------CGATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTT-GCAGAGGTAAATGAAAA
s cexc.contig_9                                       6541   533 -           42475 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGTGGCCGAAGATCTCGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cgla.contig_3                                       7673   531 +           84094 ACCTTAGCTTTAGATTCAATAAAGTCATCCAAACTCAGCGGCCGCAGGCCACAGGACGCATCCGCTCTTGTACCCTATAGAGACGAACAAGAAAAAA--GCATAAGGGCAAAAAAGCCGTGTAAGCGACAAAGTCT-ATGTCCT-A-AATTCAATTTATTTACT---GTACCTTTTGTTCTTCCTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAA-TACAAAA--CCCACAAAAAAA----GAATTAAAGATGAAGAA--AA-GTACATACAAGATGGGATC---G-GAAAACGAATACTAACCTTGAGATCGCTACCGGAATAACCTTCGGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCTGTTTCCAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCGACATATATCCTA--GAGTTCACAGGTAATAAACATCCTTTCAGTAAAGCGAGGAGTCAC------TGCTAAAA-CAGACAA-TCAATCTTCACAGACGCGAATGAAAA
s cgro6.contig_1                                      9851   533 -           68811 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACTGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cgro7.contig_1                                     10812   533 -           35331 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACTGGATTCGAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s coff.contig_8                                       7688   532 -           81550 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATAAAAAAACCCCGCAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGT-TTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cpol.contig_17                                       876   533 -           42026 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCTCGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTA---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s cpyr.contig_1                                       8691   533 -           69638 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCTCGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATGTCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCGGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCTAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTGAATGAAAA
s ctat.contig_1                                      11288   533 +           66057 ACCTTAGCTTTAGATTGAATAAAGTCATCCAAACTCAGCGGCCGAAGATCACGGGACACATCTGTTCTTGTACCCTAAAGAGACAAACAAGACAAAA--GCTTAAGGACATAAA-GCCGTGTAAGCGACAAAGTCTTATATCTT---GATTTAATCTATATACTAT-GTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAGATTCTGCAAATACAAAAACCCCACAAAAAAA----GAATTAAAGATGAAGAA--AAAGTACATACAAGATGTGCTC---GA-AAAACGAATACTAACCTTGAGATCGCTACCAGAATAGCCTTCTGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCGAGATTTTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGTAAGTCCACATATATCCTATTGAGTTCACATATAATTA-CAATTTTTCAATGAGATAAGGAGTCAAAGACTGTGTTGGAA-T-----------TTTTTGCAGAGGTAAATGAAAA
s iabu.contig_2                                       4456   547 -           14315 ACCTTAGCTTTAGATTCAATGAAGTCATCCAAACTCAGCGGCCGCAGGCCACAAGACGCATCTGTTCTTGCACCCTAAAGAGACAAACAAGACAAAAAAGCATAAGCGCAAAAA-GCCGTGTAAGCGGCAAAGTCT-ATGTCCTTAAAATTCAATCTATATAATATTGTACCTTTTGTTCTTCTTGTAACAGTTCTTGAACAGGTCTGTAAGCAGCAGCAATGCATAAATTCTGCAAATAAAAAA--CCCACAAAAAAG----AAATTATAGATGAAGAATAAAAGTACATACAAGATGTGCTTTGAAAGAAAACGAATATTAACCTTGAGATCGCTACCGGAATAGCCTTCGGTTTCCTTTGCAAGTTTCTCGAATTGGAAACCGGATTCAAGATTCTCCGGTGTGAGGAATATTTTCAGTATCTTTAACCGGTTTTCAGCATCCGGCAAGTCCACATATATACTATAGAGTTCACAGATAATAA-CATTCTTTTAATGCGACGAGGAGTCAC------TGTTAAAAACAGACAAAACAATCTTCACA-AGGCAGATGAGAA

After doing this, we went from what seemed to be 4 paralogs for ccar and 3 for cdan to the real 2 for each on this alignment block. Also, the alignment block is now more coherent and homogeneous. This is what I meant when I said that cactus / taffy should not be doing this. Now, what is going wrong exactly, I am not sure. I can only imagine some heuristics are failing.

Finally, this issue is quite pervasive in my data. To give you an idea, from the same loci I discussed in the previous issue. My raw alignments go from:

Good alignments: 10907 (0.6222)
Dups alignments: 6622 (0.3778)
Total alignments: 17529

Before intra merging, to:

Good alignments: 13915 (0.7938)
Dups alignments: 3614 (0.2062)
Total alignments: 17529

So I reduce the alignments with duplicates from more than 1/3 to ~ 1/5, just by doing intra-merging. To be fair, I am running a slightly more sophisticated version, where I also introduce gaps if the sequence positions do not match exactly but they are close enough (say 10 bp), so these numbers are rather a ceiling. Still, I can assure you that in most cases I do not have to introduce gaps because the matches are exact, so the numbers are not far off either.

Anyway, I hope this gives a better idea of what I meant. Cheers

Originally posted by @sivico26 in https://github.com/ComparativeGenomicsToolkit/cactus/issues/1201#issuecomment-1779156228

glennhickey commented 10 months ago

Cactus is aligning multiple tandem repeats together as paralogs. This touches on two issues:

1) If multiple copies are in every genome, then Cactus should leave them apart. It tries to do this already using outgroup information, but it is not perfect. This is something we are aware of and are working on (but it's tricky in the general case).

2) Even if tandem repeats are collapsed, MAF export should have a way to sensibly chain them if possible. The present logic is lacking in this department. I fixed a very similar issue here for the HAL browser. Something similar in hal2maf or taffy would be useful.

Things you can try now, as I unfortunately can't give a timeline for fixing the above issues:

sivico26 commented 10 months ago

Thank you for the suggestions Glenn, I have a couple of questions.

Which one is the one I should modify? Is the one on lib/ or the one under cactus_env? My money is on the latter but I just want to make sure. Do I have to reinstall cactus for the change to take effect or rewriting suffices?

Thanks again for your time.

glennhickey commented 10 months ago
sivico26 commented 10 months ago

Super! thanks for the recommendations!