Building a better pangenome

GeorgeBGM commented 1 year ago

Hi, I have the following questions to ask:

How to add new samples and contigs to an existing pan-genome, and whether it can be done directly using the minigraph tool.
How to merge pan-genomes of the same species built on different platforms, and whether this requirement is possible with tools vg combine and odgi squeeze.

glennhickey commented 1 year ago

We don't yet have a way to add new samples (or replace existing samples) from the pangenome.

The only type of merging that I'm aware of being possible is merging different chromosomes from the same species, which vg combine and odgi squeeze can do. Merging this way is already done as part of the minigraph-cactus pipeline, which aligns chromosomes separately and then merges them at the end into whole-genome indexes.

GeorgeBGM commented 1 year ago

Thank you for your reply.

I would like to ask if there is a proposal about directly merging constructed pan-genomic of the same chromosome from the same species or do I need to start from the beginning? How much time did it take to construct the human HPRC Phase I pan-genome?

glennhickey commented 1 year ago

There are some running times in the paper. They use an aws cluster for some steps. On a single machine, you're probably looking at about 2 weeks running time.

GeorgeBGM commented 1 year ago

Can I use minigraph and vg map to add new genomes to the pan-genome published by HPRC? Our tests found that the minigraph alignment output GFA file is smaller than the HPRC project published file, can vg map do the alignment of the whole genome? Are there any suggestions about this issue?

glennhickey commented 1 year ago

Yes, you can add genomes with minigraph (to minigraph pangenomes) with -cxggs.

For minigraph-cactus, all following steps would need to be rerun.

vg map and vg giraffe will not be able to map genome assemblies. GraphAligner may work.

GeorgeBGM commented 1 year ago

Hi, I tried to use GraphAligner(_GraphAligner -g CPC.HPRC.Phase1.CHM13v2_Non-W.gfa -f /HJ.stLFRCCS.maternal.fasta.gz -a aln.gam -x vg), But I got several errors message.Can you give me some suggestions.

glennhickey commented 1 year ago

Please report your GraphAligner issues here: https://github.com/maickrau/GraphAligner/issues/new

GeorgeBGM commented 1 year ago

Hi, My task was forced out due to exceeding the node time limit (3 days), is there some potential risk for the following task after I restart it using option --restart(cactus-pangenome command).

glennhickey commented 1 year ago

There should be no risk to trying --restart.

GeorgeBGM commented 1 year ago

Thank you for your reply.

GeorgeBGM commented 1 year ago

Hi, I would like to reproduce the construction process of the HPRC project using the step-by-step cactus process. After downloading the data of the HPRC project, I am going to use the following steps to complete the construction of the pan-genome.

Step1: cactus-minigraph Step2: cactus-preprocess (brnn) Step3：cactus-graphmap Step4：cactus-graphmap-split Step5：cactus-align Step6：cactus-graphmap-join

I want to confirm again if the above steps are in the right order.Besides, I got the following error while doing Step2, do I need to replace the # character in the FASTA sequence.Are there some other suggestions.

The specific error reported is as follows: RuntimeError: An invalid character was found in the first word of a fasta header. Acceptable characters for headers in an assembly hub include alphanumeric characters plus '', '-', ':', and '.'. Please modify your headers to eliminate other characters. The offending header: 'HG005#1#JAHEPO010000001.1' in 'HG005.1' RuntimeError: An invalid character was found in the first word of a fasta header. Acceptable characters for headers in an assembly hub include alphanumeric characters plus '', '-', ':', and '.'. Please modify your headers to eliminate other characters. The offending header: 'HG01071#1#JAHBCF010000001.1' in 'HG01071.1'

glennhickey commented 1 year ago

You can resolve that error by running cactus-preprocess --pangenome on the input data as a first step to remove the # characters.

Note: The year-2 hprc graph will be made with a single invocation of cactus-pangenome, and that's how I recommend building graphs with the current version of cactus. If you really want to exactly reproduce the released graph, please carefully look in the papers and use the cactus commits and commands there (but again, you will get better results using the latest release and interface).

GeorgeBGM commented 1 year ago

Thank you for your reply.

I would like to ask if the cactus-pangenome contains a removal step for complex regions(cactus-preprocess (brnn)).

Recently,I read article Construction and representation of human pangenome graphs, which evaluated different pan-genome construction tools(Bifrost, mdbg, Minigraph, Minigraph-Cactus and pggb), have you tested them internally in your team and which software is recommended to try in addition to Minigraph-Cactus and pggb?

Besides, Do you have any suggestions about this error(https://github.com/maickrau/GraphAligner/issues/83).

glennhickey commented 1 year ago

dna-brnn was removed from the default pipeline in Cactus Version v2.1.0. Since then, alignment gaps are used to remove complex sequence. The difference between the two approaches is touched on in the minigraph cactus paper.

That article is still on my to-read list, so I can't comment on it yet.

I don't have any suggestions for your GraphAligner error, sorry.

GeorgeBGM commented 1 year ago

Thank you for your reply.

I will continue to follow the progress related to pan-genome and hope you will share more.

GeorgeBGM commented 11 months ago

Dear @glennhickey, Is there a feasible process to add 1000 Genomes (SNPs; Indels; SVs) to the GFA pan-genome file generated by The Minigraph-Cactus Pangenome Pipeline? Can vg autoindex be used to do this?

Best, Du

glennhickey commented 11 months ago

Nope, can't be done with vg.

ComparativeGenomicsToolkit / cactus

Building a better pangenome #1037