ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
499 stars 109 forks source link

Calling novel variants using giraffe #1058

Open iamdeaardappel opened 1 year ago

iamdeaardappel commented 1 year ago

Hi!

So far I’ve been mostly playing around and testing the tools from which raised two questions. Using cactus-pangenome I have built a graph of 15 reference potato genomes but with just a single chromosome. So far, I’ve mapped 10 WGS samples against it with vg giraffe.

I have two main questions.

First, I’m struggling to find the most suitable method for calling (novel) variants that do not exist in the graph yet. The vg manual suggests augment followed by pack and call is the way to go for novel variant calling. Because the .gbz file is a read only file, I performed the following steps (I didn’t validate the variants in calls.vcf yet):

vg convert -t 12 potato-pg/potato-pg.d2.gbz -p > potato_chr1.vg
vg augment potato_chr1.pg run.gam -A new.gam > chr1_augment.vg
vg index -p -t 12 -b /tmp/ chr1_augment.vg -x chr1_augment.xg
vg pack -t 12 -x chr1_augment.xg -g new.gam -Q 5 -s 5 -o aln_aug.pack
vg call -t 12 chr1_augment.xg -k aln_aug.pack > calls.vcf

However in #943 I see the recommendation of using subject followed by a linear variant calling tool. Could you help me on the steps I should take to call novel variants?

Second, in my head it makes sense to update the graph once in a while, using high confidence variants coming from the earlier analysis. I thought of combining the cactus-mini graph with the VCF file with vg autoindex. However, it has now been running for over two days without.

vg autoindex -g potato-pg.gfa -v calls.vcf -p GFA_with_VCF

Will updating the graph be of any value? If yes, how would you approach this?

Your help will be much appreciated!

glennhickey commented 1 year ago

I think the best way to call novel variants with with vg surject to BAM, and then a linear variant caller on the BAM. You can try vg augment / call on smaller graphs, but in most cases, I think you'll get much better results with surject. This is what we did in the HPRC paper https://doi.org/10.1038/s41586-023-05896-x and Minigraph-Cactus paper https://doi.org/10.1038/s41587-023-01793-w and is described in detail in their Methods sections.

There are a number of people asking about updating existing graphs, but I do not know of any way to do this -- we do not have the tools for it yet. I'm not sure what autoindex is trying to do in your example, but do not see how it could work. The only exception I know of is graphs from minigraph -- it is fairly straightforward to add new genomes to them due to minigraph's incremental construction algorithm.