ekg / seqwish

alignment to variation graph inducer
MIT License
143 stars 18 forks source link

interpretation of bacteria pangenome graphs #123

Open jianshu93 opened 2 months ago

jianshu93 commented 2 months ago

Dear seqwish team,

I built a bacteria pangenome graph with wfmash+seqwish+bandage, which is from Chlamydia trachomatis species, about 50 genomes/strains with ANI> 99%, I have the attached graph output. Does this indicate that those genomes almost do not have rearrangement but there might be a genomic island there (the hollow hole) for some of the genomes (~40%) and some insertion at the bottom for only a few genomes, the plasmid has some structural difference and some insertions but overall are very similar for all genomes. I was just trying to get some biological information out of the pangenome graph that cannot be easily detected without it.

Thanks,

Jianshu

Chlamydia_trachomatis_strain_SQ19

ZoeYang2020 commented 1 month ago

Hi Jianshu, The graph you presented here looks really interesting. I think your interpretation makes sense. Have you tried deconstructing the graph into VCF to check those variations?

jianshu93 commented 1 month ago

Hi @ZoeYang2020, How do I deconstruct the graph to have VCF? New to this but I am extremely interested in exploring the VCF files to study those variations.

Best,

Jianshu

ekg commented 1 month ago

I think your graph is very "under aligned". You need to increase the number of alignments per sequence. What's happening is that each sequence region is aligned against only a few (maybe one) other, which is the best alignment for it. This leads to the braided pattern you see. It will be very difficult to get a meaningful VCF file out of this.

jianshu93 commented 1 month ago

Hi @ekg, since they are bacterial genomes, not so highly similar like human genomes, would you be willing to provided some detailed guidance on how to increase number of alignments per sequence in wfmash? Additionally, if ANI among them is only around 95%, what would be your suggestions for the parameters.

Thank you so much,

Best, Jianshu

ekg commented 1 month ago

Yes, the paper on pggb "Building pangenome graphs" includes a bacterial pangenome made from ~500 E. coli genomes. The gist of it is that you would use default parameters but specify a sparsification factor for wfmash to randomly subsample the mappings to align, which can reduce the effect of the quadratic all-to-all alignment. pggb -x auto will do this. There is discussion of this in the paper, look for mention of random graph theory. Notes here: https://github.com/pangenome/pggb-paper/blob/main/workflows/1.PangenomeBuilding.md

jianshu93 commented 3 weeks ago

Thanks! This is very helpful. Another question I have is is it possible to have variation graph from debruijn graph or colored de Bruijn graph, or it is just impossible to do the transform. I saw some advantages to build scalable pangenome graphs (no need to build from scratch when new genomes are added).

Thanks, Jianshu