eblerjana / pangenie

Pangenome-based genome inference
MIT License
93 stars 10 forks source link

Genotype personalized pangenome #80

Closed Han-Cao closed 1 day ago

Han-Cao commented 1 week ago

Hi,

I am trying to use PanGenie to genotype personalized minigraph-cactus graphs generated by vg haplotypes. In the personalized pangenome paper, they used below methods to create a decomposed_and_filtered.vcf as input for run-from-callset pipeline.

vg deconstruct -P GRCh38 -a <graph>.gbz -t <n_of_threads> > HG002_graph.vcf
bcftools view -s ^CHM13 HG002_graph.vcf > HG002_no_chm13.vcf
vcfbub -l 0 -a 100000 -i HG002_no_chm13.vcf.gz | vcfwave -I 10000 -t <n_of_threads> -n > decomposed_and_filtered.vcf

I have tried this method and found many variants are filtered when creating the pangenome.vcf by merge_vcfs.py merge:

Two overlapping variants at same haplotype at chr1:791616, set allele to missing.
Two overlapping variants at same haplotype at chr1:1368204, set allele to missing.
...
Total number of input alleles: 6417361
Total number of written alleles: 5213381

I manually checked some filtered variants, and they are realigned and split into biallelic variants by vcfwave:

vcfbub output:

chr1    791616  >32685013>32685003      AATGGA  A,AATGGAATGGA   60.0    .       AC=1,1;AF=0.333333,0.333333;AN=2;AT=>32685013>32685012>32685003,>32685013>32685003,>32685013>32685012<32685011>32685003;NS=2;LV=0       GT      2|1
chr1    1368204 >32721851>32721861      TTTTTTTT        TTTTTTTTTTTTTTT,T       60.0    .       AC=0,2;AF=0.333333,0.666667;AN=2;AT=>32721851>32721852>32721853>32721854>32721855>32721856>32721861,>32721851>32721852>32721853>32721854>32721855>32721856>32721857>32721858>32721859>32721860>32721861,>32721851>32721861;NS=2;LV=0    GT      2|2

vcfwave output:

chr1    791616  >32685013>32685003_1    AATGGA  A       60      .       AC=1;AF=0.333333;AN=2;AT=>32685013>32685012>32685003;NS=2;LV=0;ORIGIN=chr1:791616;LEN=5;INV=0;TYPE=del  GT      0|1
chr1    791621  >32685013>32685003_2    A       AATGGA  60      .       AC=1;AF=0.000000;AN=-1;AT=;NS=2;LV=0;ORIGIN=chr1:791616;LEN=5;INV=0;TYPE=ins    GT      0|1
chr1    1368204 >32721851>32721861_1    TTTTTTTT        T       60      .       AC=2;AF=0.666667;AN=2;AT=>32721851>32721852>32721853>32721854>32721855>32721856>32721857>32721858>32721859>32721860>32721861;NS=2;LV=0;ORIGIN=chr1:1368204;LEN=7;INV=0;TYPE=del GT      1|1
chr1    1368211 >32721851>32721861_2    T       TTTTTTTT        60      .       AC=2;AF=0.000000;AN=-1;AT=;NS=2;LV=0;ORIGIN=chr1:1368204;LEN=7;INV=0;TYPE=ins   GT      1|1

As a personalized pangenome graph is a subset of the original MC graph, can I run prepare-vcf-MC pipeline on the output VCF of vg deconstrct and vcfbub without using vcfwave? Do you have any suggestions on genotyping personalized pangenome graph?

Thanks a lot.

eblerjana commented 2 days ago

The error occurs because there are overlapping variants on the same haplotype, which is not possible. As far as I understand, for the examples you shared, the problem is that vcfwave creates VCF records that do not agree with the original VCF line. The variant at position 791616 in the vcfbub VCF has genotype 2|1, but after applying vcfwave, it is converted into two records both having genotype 0|1. This is wrong, the genotypes should be as shown below:

chr1    791616  >32685013>32685003_1    AATGGA  A       [...]  GT      0|1
chr1    791621  >32685013>32685003_2    A       AATGGA  [...]    GT      1|0

I guess this is a bug in vcfwave, so my suggestion would be to report this to the authors of vcfwave.

We typically do not use vcfwave for variant decomposition, but rather use our own decomposition approach, but that was designed for graphs created by the Minigraph-Cactus pipeline only. If your graphs were created by Minigraph-Cactus, the recommended way of running PanGenie is described in the PanGenie Wiki: https://github.com/eblerjana/pangenie/wiki/D:-Running-PanGenie-on-HPRC-data. So basically the steps you'd need to perform are:

  1. run vcfbub, but skipping the vcfwave step: vcfbub -l 0 -a 100000 -i HG002_no_chm13.vcf.gz | bgzip > HG002_no_chm13_vcfbub.vcf.gz

  2. construct PanGenie-ready VCF from the output of step 1 using the pipeline: https://github.com/eblerjana/genotyping-pipelines/tree/main/prepare-vcf-MC. As a result, two VCFs are generated (see README):

    • graph-vcf (PanGenie input VCF): {results}/vcf/{callsetname}/{callsetname}_filtered_ids.vcf
    • callset-vcf (helper VCF for variant decomposition): {results}/vcf/{callsetname}/{callsetname}_filtered_ids_biallelic.vcf
  3. run PanGenie WITHOUT the run-from-callset pipeline, simply using these commands and the two VCFs produced in step 2.

    PanGenie-index -v <graph-vcf> -r <reference-genome> -t 24 -o index
    PanGenie -f index -i <input-reads> -o pangenie -j 24 -t 24
    # decompose bubbles and produce a bi-allelic VCF with genotypes for each (nested) allele
    cat pangenie_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > pangenie_genotyping_biallelic.vcf
Han-Cao commented 2 days ago

Many thanks! I understand why there is an error. I have reported this issue to the authors of vcfwave.

I tried the prepare-vcf-MC pipeline, but I have one question regarding the GFA to be used when genotyping GRCh38-based VCF. If I understand correctly, the GFA is used by annotate_vcf.py to get the sequence of segments and the position (SO tag) of segments on reference genome. Therefore, if I want to process a GRCh38-based VCF, I think I need to input a GRCh38-based rGFA with tags like:

S       1799    A       SN:Z:GRCh38#0#chr10      SO:i:287817     SR:i:0

However, if the Minigraph-Cactus graph was created with --reference CHM13 GRCh38, I cannot extract the rGFA tags for GRCh38 segments without modifying the graph (see details at vg issue.)

I am wondering that:

  1. Can I use a GFA without the SO tag to run the pipeline?
  2. If the SO tag is necessary, can I skip the annotate_vcf.py step to preprocess the input VCF? I understand that if I skip this, I cannot convert the genotyping results to biallelic VCF, but would it also affect the genotyping accuracy?

If the SO tag is necessary and I cannot skip the annotate step, maybe my only choice is to switch all the analysis to CHM13?

eblerjana commented 1 day ago

Ok I understand the problem.

regarding question 1.) The SO tag is indeed necessary, because my script uses it to determine coordinates of nested alleles.

regarding question 2.) Yes, you can in principle skip this step. But as you said, then the conversion to biallelic would not work. Briefly, what would be missing is the decomposition of bubbles into the actual alleles they contain. So what you end up with is the genotypes for all bubbles (same as without skipping the annotate step), but these bubble genotypes cannot be converted into genotypes for the nested variants (= the step converting to biallelic). So in other words, you have the bubble genotypes, but you don't know which variants the bubbles actually contain. This becomes a problem whenever you want to compare your genotyped set to other variant callsets.

I guess it depends a bit on the application, i.e. what you want to do with the genotypes later. If it is sufficient to have a single genotype for each bubble (instead of multiple genotypes for all nested variants), then you can use the pipeline without the annotation step. But especially in case you actually want to study the individual variants nested in the bubble structures, I think the best option would be to switch to CHM13 (or make the vcfwave pipeline work, in case the developers are quick to provide a fix).

Han-Cao commented 1 day ago

Thank you for the explanation! Comparing with other call set is very important for my analysis. I would use CHM13 instead. It seems the best way to do analysis on GRCh38 is to start from a GRCh38 graph.