Add short sequenced data to our pan-genome

ld9866 commented 1 year ago

Hello developers! I was building a 15 genome pan-genome using minigraph-cactus and got the vcf file without a problem. Here, I want to add short sequenced data from 500 individuals to our vcf file to build a more complete pan-genome for the following analysis. What should I do? Best yours.

glennhickey commented 1 year ago

Right now your choices are

genotype variants in your 500 samples that are already in the graph (PanGenie or vg call)
create a new VCF for your 500 samples by "surjecting" their read mappings to a reference path and using a linear variant caller (we use Giraffe -> surject -> DeepVariant)

To make a combined VCF, you would need to carefully merge the two VCFs from above.

You can, in theory, map your reads to the graph and add back the variants with vg augment -m but this won't give you a VCF and may introduce lots of noise, so I don't really recommend it.

ld9866 commented 1 year ago

Thank you for your patient reply, but I have one more question. We got the vcf file using minigraph-cactus build, but it takes a long time to do vg auto index, and now it's been a few days and still not done, and we have 15 genome files that are similar in size. Code：vg autoindex --workflow mpmap -t 4 --prefix vg_rna --ref-fasta example_data/x.fa --vcf example_data/x.vcf.gz --tx-gff example_data/x.gtf Best wishes!

glennhickey commented 1 year ago

You definitely don't want to go GRAPH->VCF->GRAPH. If you want to re-index the results of minigraph-cactus, you should start with the GFA or GBZ, not the VCF.

ld9866 commented 1 year ago

Thank you for your reply! We now want to use rpvg to explore the pan-transcriptome study, so we want to build the index file before we start the subsequent analysis. From the introduction document of rpvg, we found that the first step is to build the index file, and we want to build it and then compare the transcriptome data to the pan-genome. How do we do that? The example code:

Construct and index spliced pangenome graph

The easiest way to start this pipeline is to use the vg autoindex subcommand to make indexes for vg mpmap. vg autoindex creates indexes for mapping from common interchange formats like FASTA, VCF, and GTF. It effectively combines the vg rna step and the indexing for vg mpmap.

More information is available in the wiki page on transcriptomics.

Working from this directory, the following example shows how to create a spliced pangenome graph and indexes using vg autoindex with 4 threads:

# Create spliced pangenome graph and indexes for vg mpmap
vg autoindex --workflow mpmap -t 4 --prefix vg_rna --ref-fasta example_data/x.fa --vcf example_data/x.vcf.gz --tx-gff example_data/x.gtf

This will create several files with the prefix vg_rna, which can be used in rpvg and vg mpmap.

Map reads to the spliced pangenome graph

RNA-seq reads can be mapped to the spliced pangenome graph using vg mpmap with 4 threads:

# Map simulated RNA-seq reads using vg mpmap
vg mpmap -t 4 -x vg_rna.spliced.xg -g vg_rna.spliced.gcsa -d vg_rna.spliced.dist -f example_data/x_rna_1.fq -f example_data/x_rna_2.fq > mpmap.gamp

This will create a multipath alignment file called mpmap.gamp.

glennhickey commented 1 year ago

vg autoindex can accept gfa.

ld9866 commented 1 year ago

Thank you for your reply Best wishes

ComparativeGenomicsToolkit / cactus

Add short sequenced data to our pan-genome #943

Construct and index spliced pangenome graph

Map reads to the spliced pangenome graph