Closed Han-Cao closed 1 day ago
The error occurs because there are overlapping variants on the same haplotype, which is not possible. As far as I understand, for the examples you shared, the problem is that vcfwave creates VCF records that do not agree with the original VCF line. The variant at position 791616 in the vcfbub VCF has genotype 2|1, but after applying vcfwave, it is converted into two records both having genotype 0|1. This is wrong, the genotypes should be as shown below:
chr1 791616 >32685013>32685003_1 AATGGA A [...] GT 0|1
chr1 791621 >32685013>32685003_2 A AATGGA [...] GT 1|0
I guess this is a bug in vcfwave, so my suggestion would be to report this to the authors of vcfwave.
We typically do not use vcfwave for variant decomposition, but rather use our own decomposition approach, but that was designed for graphs created by the Minigraph-Cactus pipeline only. If your graphs were created by Minigraph-Cactus, the recommended way of running PanGenie is described in the PanGenie Wiki: https://github.com/eblerjana/pangenie/wiki/D:-Running-PanGenie-on-HPRC-data. So basically the steps you'd need to perform are:
run vcfbub, but skipping the vcfwave step: vcfbub -l 0 -a 100000 -i HG002_no_chm13.vcf.gz | bgzip > HG002_no_chm13_vcfbub.vcf.gz
construct PanGenie-ready VCF from the output of step 1 using the pipeline: https://github.com/eblerjana/genotyping-pipelines/tree/main/prepare-vcf-MC. As a result, two VCFs are generated (see README):
{results}/vcf/{callsetname}/{callsetname}_filtered_ids.vcf
{results}/vcf/{callsetname}/{callsetname}_filtered_ids_biallelic.vcf
run PanGenie WITHOUT the run-from-callset pipeline, simply using these commands and the two VCFs produced in step 2.
PanGenie-index -v <graph-vcf> -r <reference-genome> -t 24 -o index
PanGenie -f index -i <input-reads> -o pangenie -j 24 -t 24
# decompose bubbles and produce a bi-allelic VCF with genotypes for each (nested) allele
cat pangenie_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > pangenie_genotyping_biallelic.vcf
Many thanks! I understand why there is an error. I have reported this issue to the authors of vcfwave.
I tried the prepare-vcf-MC pipeline, but I have one question regarding the GFA to be used when genotyping GRCh38-based VCF. If I understand correctly, the GFA is used by annotate_vcf.py to get the sequence of segments and the position (SO
tag) of segments on reference genome. Therefore, if I want to process a GRCh38-based VCF, I think I need to input a GRCh38-based rGFA with tags like:
S 1799 A SN:Z:GRCh38#0#chr10 SO:i:287817 SR:i:0
However, if the Minigraph-Cactus graph was created with --reference CHM13 GRCh38
, I cannot extract the rGFA tags for GRCh38 segments without modifying the graph (see details at vg issue.)
I am wondering that:
SO
tag to run the pipeline?SO
tag is necessary, can I skip the annotate_vcf.py
step to preprocess the input VCF? I understand that if I skip this, I cannot convert the genotyping results to biallelic VCF, but would it also affect the genotyping accuracy?If the SO
tag is necessary and I cannot skip the annotate step, maybe my only choice is to switch all the analysis to CHM13?
Ok I understand the problem.
regarding question 1.) The SO tag is indeed necessary, because my script uses it to determine coordinates of nested alleles.
regarding question 2.) Yes, you can in principle skip this step. But as you said, then the conversion to biallelic would not work. Briefly, what would be missing is the decomposition of bubbles into the actual alleles they contain. So what you end up with is the genotypes for all bubbles (same as without skipping the annotate step), but these bubble genotypes cannot be converted into genotypes for the nested variants (= the step converting to biallelic). So in other words, you have the bubble genotypes, but you don't know which variants the bubbles actually contain. This becomes a problem whenever you want to compare your genotyped set to other variant callsets.
I guess it depends a bit on the application, i.e. what you want to do with the genotypes later. If it is sufficient to have a single genotype for each bubble (instead of multiple genotypes for all nested variants), then you can use the pipeline without the annotation step. But especially in case you actually want to study the individual variants nested in the bubble structures, I think the best option would be to switch to CHM13 (or make the vcfwave pipeline work, in case the developers are quick to provide a fix).
Thank you for the explanation! Comparing with other call set is very important for my analysis. I would use CHM13 instead. It seems the best way to do analysis on GRCh38 is to start from a GRCh38 graph.
Hi,
I am trying to use PanGenie to genotype personalized minigraph-cactus graphs generated by
vg haplotypes
. In the personalized pangenome paper, they used below methods to create adecomposed_and_filtered.vcf
as input for run-from-callset pipeline.I have tried this method and found many variants are filtered when creating the
pangenome.vcf
bymerge_vcfs.py merge
:I manually checked some filtered variants, and they are realigned and split into biallelic variants by
vcfwave
:vcfbub
output:vcfwave
output:As a personalized pangenome graph is a subset of the original MC graph, can I run prepare-vcf-MC pipeline on the output VCF of
vg deconstrct
andvcfbub
without usingvcfwave
? Do you have any suggestions on genotyping personalized pangenome graph?Thanks a lot.