Building a pangenome from pooled contigs

jyj5558 commented 1 year ago

Hi, thanks for your efforts always.

In a line of my previous question #1108, the reason of this question was due to pooled contigs from one population. So the contigs do not have information which sample (or samples) a particular contig was derived from. The contigs are stored in a fasta format with non-redundant contig names. Since a pangenome refers to collection of genes or variants from a gene pool I think pooled-contigs would be ok conceptually, but I wonder if this matters technically in the Minigraph-Cactus pipeline. In the fasta file there would likely be contigs that are not exactly matching but can overlap each other on a reference genome's region like alternative contigs.

If I input the single fasta file along with a reference genome into the M-C pipeline, in seqFile like:

pooled-contigs /path/to/pooled/contig.fasta reference /path/to/referecen.fasta

will this approach somehow disrupt the pipeline? I guess it might be okay as the pangenome pipeline is efficient to represent alt contigs, still hope to have your insights.

Thanks always for your fast and insightful reposponse, which is very helpful for me.

glennhickey commented 1 year ago

You may be able to get an alignment out of that input but I can't recommend it-- the tools was not designed nor tested on this type of use case. It expects input assemblies to be assigned to samples.

The alt graph was made by putting each alt contig into its own sample. If you have a limited number of contigs (under about 500), you can try that approach and it should work out okay

https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md#grch38-alts-graph

jyj5558 commented 1 year ago

Thanks for your reply. I should ponder more on this approach and should also visualize the resultant graph pangenome.

ComparativeGenomicsToolkit / cactus

Building a pangenome from pooled contigs #1114