ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
523 stars 111 forks source link

How does Minigraph-Cactus treat exactly same contigs from different samples? (and a question on "--otherContig") #1108

Closed jyj5558 closed 1 year ago

jyj5558 commented 1 year ago

Hi, thanks so much for maintaining this great tool! I have read the 2023 Mini-graph cactus paper and got some questions regarding the pipeline.

Let's say, if the Sample A and Sample B are consisted of exactly same two contigs only, Contig X and Contig Y, (i.e., Sample A = Sample B = Contig X + Contig Y), and Sample A is used first in the Minigraph-Cactus pipeline based on a reference genome, then Sample B will not contribute to the resulting graph pangenome? I guess this is the case because the Contig X of Sample A and Contig X of Sample B both will be aligned to the same location on the reference genome, so there will not be new augmentation in the resulting graph pangenome, but want to confirm if I am right.

And another question is that when I use "--otherContig chrOther" option, how the contigs that are assigned to "chrOther" (non-reference contigs, or "other" contigs) will be treated. (1) They will not be treated as reference contigs anymore? Then in the generated single "chrOther" graph file, those "other" contigs from the reference genome will be integrated/augmented into one graph? (2) Or they will be remained acyclic like "reference" contigs from the reference genome and be augmented by other samples if their contigs can be aligned to them?

I hope these questions are clear for you. I really wonder your answers.

I appreciate your reply, Thanks.

glennhickey commented 1 year ago

All samples contribute to the pangenome, unless they cannot be aligned to any reference contig. So in your first case, your pangenome would have two contigs, X and Y, and each would contain the appropriate contig from both samples.

The only time a contig does not contribute to the pangenome is if it doesn't align to a reference contig. So if Sample B had Contig Z that did not map to either X or Y in Sample A, it would get dropped out of the pangenome.

An alignment job and graph file is created for each reference contig. This gets annoying in cases where the reference has thousands of little contigs. The chrOther option lets these small contigs get folded up into one job / file. Each reference contig remains a disconnected component that is aligned independently, and the final pangenome graph is unaffected.

So if you have 3 contigs, A,B,C and ran with --refContigs A B C you would get A.vg, B.vg, C.vg in your chromosome output. If you ran with --refContigs A --otherContig chrOther you would get A.vg, chrOther.vg. But chrOther.vg here is exactly the same as vg combine B.vg C.vg and all the whole-genome graphs and indexes (.gbz .gfa etc) will be identical between the two approaches.

jyj5558 commented 1 year ago

I do appreciate your clear answers!

I understand that each sample will contribute to the graph pangenome, but how about each contig's viewpoint? So in this case, both Sample A and Sample B has the Contig X that will be aligned/augmented onto the same reference contig. Then, Contig X will be represented twice on the reference contig of the final graph pangenome? Or Contig X will be represented just once but it will be crossed twice by Sample A's path and Sample B's path?

Thanks for your answers!

glennhickey commented 1 year ago

There will be one instance of the contig, and two paths through it.

jyj5558 commented 1 year ago

Thanks for quick and super-clear responses! That addresses my remaining questions so I will close this. I really appreciate that.