ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
505 stars 111 forks source link

Contigs dropped silently from output #1372

Closed MarionPerrier closed 2 months ago

MarionPerrier commented 5 months ago

Dear developers,

My issue is very similar to https://github.com/ComparativeGenomicsToolkit/cactus/issues/1212. I decided to open a new one, considering I work with the latest release of Minigraph-Cactus.

I am working with the same yeast genomes used in the tutorial (S288C, Y12, YPS128, UWOPS034614, SK1, and DBVPG6044) + a very fragmented assembly (BY4741). I am trying to build a reference pangenome combining these 7 assemblies to later on use it as reference for an RNAseq experiment. However, some contigs from the fragmented assembly are silently getting dropped out from my final .gfa graph, and I can't figure out how to keep them in.

I modified the configFile to allow MAPQ=1 contigs in the graph, and added -d 500 in the minigraphConstructOptions in graphmap options. I also apply the --permissiveContigFilter 0.1 However, I still have some contigs that are not in the final GFA. Here are some of them:

Can you help me figure out what caused these contigs to be filtered out? Could it be due to their sizes? In this yeast example, the biggest dropped contig is 2,979bp. But I am also working with another fragmented fungal species, and the biggest "unexplained" contig dropped is 51,169bp.

glennhickey commented 5 months ago

Yeah, there isn't a log for every contig dropped and why (though it would be nice). If I look at scaffold_191, it simply doesn't map anywhere: minigraph (and minimap and lastz) return no output from it. So it effectively disappears during mapping. (I see you claim to get a mapq 60 record in your PAF for it, but I can't reproduce that at all).

And looking at the contig I don't think it's terribly surprising that it can't be aligned anywhere.

>BY4741#0#scaffold_191
GAGTGAGGGACCCCCCCCTTACGGGGGGGAACCGAACCCCTTTTTAAGAAGGAGAATATT
TTTTATATCTTTCCTTTATTAATAACAATGATTAACTAAATTGACAATCACAAAGTTATA
ATATTATTATTATATAAATTAATAATATTATATAATTTATAAATTTATACATCTTTTTTA
TTAAATACTTTTTTATAAATATTAATATAATAAAAGATTTTTAATATATTAATAATTATT
ATATTAATCTTTAATAATAAAATAAAAATAATAATAATAAAAATAGAATTTTATAAATAA
ATAATTATAAATAATAAATTTAAATAATATTATTAAATATTATTTAATTATTAATTATGT
AATTAATATTTATATTATATAAAGTATTCAATACTTATTAAAATTAATATTTTTATAAAT
AAAATAAAAATGTAATAATTATAAAAATACCCTTTTTAATATTATATAATTATAAATATA
ATTATTATATAACCCCTATAAAAATTAATATTTAATATTTAATATTTAATATTTAATATT
TAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAAT
ATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTT
AATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATA
TTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTA
ATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATAT
TTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAA
TATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATTTAATATT
TAATATTTAATATTTAATATTTTATATTTTATATTAATTGTATAATAAAAATAAAATATA
TAAATTATATATTATAAATATAAATTATTCTTTTTATAAATATTTATTAATATTTATTAA
AAAATTATATATATATATATATAGATTATAAATTATATATGTTTACTCCCACCCCCTTTT
CGAAATTACAATTATAATTAGTTTAATAAAAAAAAAATAAATAAATAAATAAATTATATA
AAAAAGTATATTTATTAATATTTGATAATATTAATATATTGTAATTATATTAATTATTAT
TATATTTAATATATTAATAATAAAGAAATACAAATTATTATTAAAGTATTAATTATTATT
TAAAATTATATTAGTCCTTCCACCTTTTTATTTTTTAAGAAGGAGTGAGAGACCCCCTCC
CGTGTACTAACGGGAGGGGGACCGAACCCCTTTTTATTCTTAAGAAGGAGTGAGGGACCC
GTGGGGACCGAACCCCGAAGGAGTTATTTATATTATTATTATTTAATAATAATATATAAA
AATATAAATTTATTATTATTATTTTATATAAAATATATATAATATATAATAAAATATATA
AATGATATTATTATTTTTATTAAATATATATATATAATGGAATATATAAATGGTATTATT
ATTTTATTAATTTAATAAAAAAAATAAAAACCTTTAAGACTATAACTTGCCATTAGTAAA
TTATTATTATTTACCCCTCCAATGAATTTTATTGAATTATATACTAAAATATTATAATAT
TATTTTTTTTTAATATTTATTAATATTTATTAAAAGATTATATGGAAATGGATACTTATA
TATATATATATATTTATTTTATAATTAAGAGTTATAATTATTATATAATTTATTTAATCC
TCACTACCTTTTATTATTAATATATATATAACATATTTTTATTATATAAAAATATAAAAA
AAAGATATAATTTTCATAATAAAATTATAATTTAATTTAATATATAAATATATATTTATA
TAAAATTATTATATTATTATTATATTATATAATATTATTAATAATATATTTTATTTAGTT
TCGGGCCCCGGCTACGGGAGCCGGAACCCCGTAAGGAGAAATATATTTTTATAAAATATT
AATTAATTAATTAATTAAAGAAAAAAAAGAAATAATAATTATTTGATAATATATATTATT
ATTATATATAATTATAAAAAAGGTTATAATTATTTTTATTTATAAAATATAAAGTATATT
ATTTAAATTATTAATTAAGTTAGATTTATATTTATAAATATATAAGTCCCGGTTTCTTAC
GAAACCGGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNAGGGGGAGGGGGTGGGTGATAATAACCAGAATA
TTAAATATATATAGAGCACACATAGAATAAATTTTATAACATAATCAATAAATATATTAT
AAGAATATAATATATTATATAATAAAATATAAAGTCCCCGCCCCGGCGGGGACCCCGAAG
GAGTATAAACGATATAATTAATTATATAATATAAATATAAATTAAAAATAATAATAAATT
TAATAAAATAATAAATGATAAACAAGAAGATATCCGGGT
MarionPerrier commented 2 months ago

Hi again,

Sorry to have let this issue open for so long. In the end, I filtered out these "non-sense" contigs. Despite this I still had some contigs dropped, but they carried little to no coding genes, which is ok considering my downstream analysis.

Thank you! Have a good day, Marion