eblerjana / pangenie

Pangenome-based genome inference
MIT License
114 stars 10 forks source link

Some variants in input VCF file were failed to genotype #57

Open JanMiao opened 1 year ago

JanMiao commented 1 year ago

Hi, I have observed that some variants located at the ends of chromosomes seem to be unsuccessful in genotyping from all samples in my dataset. If this phenomenon is normal? Could you please provide some explanation for it?

Thanks !

eblerjana commented 1 year ago

Hi,

what exactly do you mean by "unsuccessful in genotyping"? Where they reported with a "./." genotype?

Without knowing any details on your experiments, it's hard for me to say what the reason is. Could be related to the lack of unique kmers (since these regions often contain a lot of repetitive sequence). But it could also be related to the input panel. What data are you using to run PanGenie? Which command did you use?

JanMiao commented 1 year ago

Hi,

My input VCF file was generated using haploid assemblies, and variant calling on WGS data was performed using pangenie. However, in the VCF file obtained from pangenie, some structural variants were missing entirely. It is not simply a case of missing genotypes "./."; these records are completely absent from the VCF file.

I initially expected that pangenie would genotype all the variants present in the input VCF file, even if some variants had a genotype of "./.". However, I have discovered that some variants are missing entirely. If the reason for the missing variants is due to "lack of unique k-mers", would individuals with high-depth resequencing be more successful in genotyping?

eblerjana commented 1 year ago

Which version of PanGenie are you using? Which command line? And what does the log file report?

PanGenie genotypes all variant present in the input file, but sometimes no genotype can be computed (e.g. if computed genotype likelihoods are the same for several possible genotypes), in this case, genotype "./." is reported.

PanGenie (latest version) only completely skips variants if they contain Ns in the REF/ALT field, or if they are closer than 2*kmer_size bp to the start or end of a chromosome. These cases would be missing completely from the output VCF (but will be reported in the log).

If no unique kmers are found, genotypes are imputed from the panel, but in repetitive sequence contexts (like the telomeres), it might not be possible to compute a genotype, because genotypes have the same likelihoods, in this case "./." is reported. Higher depth would likely not help much here, because the problem is in the complexity of the genome sequence.