eblerjana / pangenie

Pangenome-based genome inference
MIT License
114 stars 10 forks source link

A lot of multi-allelic SV variations in the genotype result file #61

Closed ld9866 closed 10 months ago

ld9866 commented 10 months ago

Dear developer: We used Pangenie to generate population variation genotypes. We found that there were a large number of multi-allelic structural variations in the result file, and most of their differences were in a few bases. Such structural variations accounted for 80% (20W) of the total SV. Suppose we violently delete all the multi-allelic structural variants at this site. In that case, we only retain the 4W structural variants for subsequent analysis, and much of the genetic variation information is lost. If we keep multiple variants at the same site, the subsequent conversion to plink format will indicate that there cannot be multiple variants at one site. I want to ask you how to deal with this problem. Best day

eblerjana commented 10 months ago

Hi,

PanGenie outputs exactly the same VCF records that were given as input, just with computed genotypes. Which input VCF are you using? How was that VCF generated?

Since PanGenie input VCFs typically encode bubbles in a graph, it is totally expected that they are highly multi-allelic. These bubbles often contain hundreds of (smaller) nested variant alleles (so bubbles in graph != variant alleles), that is why we typically use decomposition approaches to call variation from these bubble structures (see Wiki for further explanation: https://github.com/eblerjana/pangenie/wiki/A:--Genotyping-variation-nested-inside-of-bubbles).

For VCFs produced by Minigraph-Cactus, a pipeline is available here (also linked in README): https://github.com/eblerjana/genotyping-pipelines/tree/main/prepare-vcf-MC. It preprocesses the PanGenie input VCF and adds annotation that can be used to call nested variants from bubbles after genotyping. For such a VCFs, genotyping can be run explained here: https://github.com/eblerjana/pangenie/wiki/A:--Genotyping-variation-nested-inside-of-bubbles, so basically just running the convert-to-biallelic.py after genotyping). The result will be a bi-allelic representation containing all variant alleles found inside of the bubbles. The PanGenie input VCFs provided in the README were already annotated, so if you are using one of them, you simply need to add the "convert-to-biallelic" step and no further preprocessing of the input VCF is needed.

ld9866 commented 10 months ago

Dear developer: Thank you for your patience, we are readly to do the following analysis. Have a good day