Genome graph construction, variant filtering & phasing

swomics commented 1 year ago

Hi,

Thanks for your work on this program. The method seems really elegant and we've had some really exciting results from it so far.

Here is a brief outline of our preliminary analysis:

We have a well assembled reference genome plus 8 individuals sequenced with HiFi, two of the individuals are properly phased with parent short read data, whereas the others are just mosaic haplotype assemblies.
We then have a large number of short read population samples, which we'd like to genotype.
There is a specific region of the genome that we are interested in, which has an elevated concentration of large SVs that overlap, including an insertion of >20Kbp. Getting accurate genotyping results for this region from short-reads would be really useful for our subsequent analyses.
Currently the genotyping of some of the TE sequences in this region have fairly low GQ scores and so don't seem completely reliable.

I have a couple of questions to make sure I'm getting the most out of the program:

1) I initially had good results with minigraph-cactus for generating a comprehensive vcf file. However I read on the PanGenie issues page, that this is not a recommended program. There was a reference to vcfbub, but It's not clear to me what steps need to be taken to go from the reference + long read assemblies to an appropriate pangenome graph. I didn't notice any obvious issues with the subsequent short read genotyping results from minigraph-cactus, but I'm worried that I'm not getting the best results.

2) I was trying to be quite conservative with the genotyping results, I used the GQ flag as mentioned in the paper, and then also applied a minimum KC of 7 per sample. Is this a reasonable filtering approach?

3) I noticed that there is an experimental phasing option. We are quite interested in the haplotype structure of our data, and initially we used Shapeit (using the HiFi individuals as reference haplotypes) to phase the population data. Are there anymore details available about PanGenie phasing? Is it worth applying the PanGenie phasing prior to Shapeit perhaps? Is it worth combining PanGenie genotyping with read-based phasing programs like WhatsHap?

eblerjana commented 1 year ago

Hi,

regarding 1.) We used the output VCF generated byvg deconstructfrom the minigraph-cactus graph and ran vcfbub with parameters-l 0 and -r 100000on that VCF to filter out LV>0 records (=nested variants), since these contain redundant information that cannot be handled properly by PanGenie. The resulting VCF can then be used as input VCF to PanGenie and in our experience, that worked fine. Which tool did you use in our case to generate the VCF?

2.) I think the KC value is not really useful for filtering, because it is unrelated to the genotype. It just reflects the average read k-mer coverage in the region of the genome where the variant is located. I would suggest to use mainly the GQ field for filtering, and maybe it is worth looking at the UK field (INFO field), since this reflects the number of unique kmers of the variant bubble that were used to infer the genotype. So basically the more unique kmers there are the better. I mainly use the GQ field for filtering though.

3.) I would not recommend to use PanGenie for phasing, because the panel it uses is way to small to produce any reliable haplotype predictions. In the best case, the output might be useful to get a rough idea about how the haplotypes look like and maybe combining it with other tools to refine the haplotypes works, but also note that we have never really used and benchmarked it unfortunately, so I cannot really say whether it is worth it or not. Regarding WhatsHap: WhatsHap is mainly designed for phasing small variants and I'm not sure it works for phasing SVs. We have never tested this unfortunately.

swomics commented 1 year ago

1) Ah ok, I just used the vcf file that is produced at the end of the cactus pangenome workflow. I will have a look at applying the subsequent deconstruct and vcfbub steps. My main concern is whether I will lose the locus we are interested in that contains lots of overlapping large SVs.

2) I think I misunderstood KC, I thought it might be the number of k-mers supporting a specific genotype call. I found that the GQ field was saturated at a very high value for a lot of variants. I noticed when I was inspecting the data that some of these high-quality calls contradicted the mendelian expectations from parent-child trios, and so I was worried that there may be cases of heterozygotes being called as high confidence homozygotes, at loci with low coverage when an alternate allele may not be represented just by chance.

3) Thanks for the details regarding phasing.

swomics commented 1 year ago

Hi again, just following up on this. I've been comparing the calls to a more typical variant calling approach using bwa mapping followed by octopus. There definitely seems to be cases where Pangenie has called variants as homozygous with genotype quality 10000, whereas the mapped reads and octopus genotyping suggests heterozygous positions with 1 or 2 well-mapped reads supporting the existence of an alternate allele. In one case there were 4 vs 5 reads suggesting a heterozygous SNP, which had been called as homozygous by Pangenie with GQ = 10,000.

The Pangenie approach has been really great for quickly genotyping samples and getting some very compelling Fst plots, and interesting structural variant genotyping results. However, the issue described above has given me significant doubts about the popgen statistics I'm calculating and the overall results of phasing with Shapeit. Do you have any ideas about whether this issue is resolvable?

Could it be caused by the available references or something incorrect about the graph I've used perhaps? The KC value for the region is 4 for some of these problem variants, despite the read coverage being ~12x or so. Is that indicative of an issue?

eblerjana commented 1 year ago

Hi,

can you send concrete examples of these cases where positions are typed wrongly? Just to make sure: for these variants, are there panel haplotypes carrying the respective alternative allele? Since PanGenie cannot discover new alleles, it can only re-type alleles seen previously in the other haplotypes (meaning if all panel haplotypes carry the "0" allele, the genotype of the sample will be 0/0). This is an issue not specific to PanGenie, but any other re-genotyper. Variant callers in contrast do not have this problem.

One possible reason might also be the panel you are using. If I understood it correctly, you only have short-read phased haplotypes. PanGenie was designed to be used for fully assembled haplotypes, e.g. haplotypes fully phased into a single block, which is not achievable with short-reads (read-based phasing would always produce multiple blocks of phased variants). The PanGenie model uses the haplotype information to infer genotypes, which means if the phasing information is not accurate (or fragmented), this will have an effect on the results.

Low KC values mean that there is only a small number of unique kmers available, which often is the case for SVs located in repetitive sequence context. In these cases, genotypes are imputed from the haplotypes. So in these cases, it is especially important that the phasing is accurate (see my previous point).

swomics commented 1 year ago

Hi, thanks for the quick response!

Yes, no problem. I've put together a subset of the relevant data so you can see. I first saw the problem in the variants between 12,391,213-12,391,643

The variants are present in the reference panel.
The reference panel/pangenome graph is derived from high quality HiFi de novo genome assemblies, some are phased and some are mosaic, but assemblies are highly contiguous so the phase blocks should be very long.
The short-read phasing I was referring to is the downstream phasing of the pangenie output, i.e. phasing the genotyping results rather than the reference panel.

PangenieTest.zip

eblerjana commented 1 year ago

Thanks! I'll have a closer look.

Which version of PanGenie did you use? And did you run genotyping on the whole genome, or just a certain region?

eblerjana commented 1 year ago

Sorry, and would it be possible to share the reference FASTA as well?

swomics commented 1 year ago

Version 2.1.1. The genotyping was run on the whole genome. I made subsets to make the files smaller. You can find the reference chromosome for the subset here: https://www.ncbi.nlm.nih.gov/nuccore/FR989875.1/

eblerjana commented 1 year ago

Great, thanks a lot! I looked at the VCFs and at the alignments and made noticed these things:

the overall read coverage is quite small: is that the case for the whole genome, or just in this region?
Most of the variants (especially those between 12,391,495-12,391,549) have just one or two reads supporting the reference allele (and all others support the ALT allele). From the alignments, they do not look like heterozygous variants. If they were really heterozygous, one would expect a higher amount of REF alleles (ratio REF/ALT should be roughly 50/50). Accordingly, these variants are not flagged as "PASS" in the octopus-VCF. Instead the FILTER is set to "AFB" which, according to the VCF, means that "The called allele frequencies are not as expected for the given ploidy", suggesting some issues here (I have never used octopus, but for other callers typically one would filter out variants not flagged as "PASS").

So I think it would be a good idea to first verify that the octopus results are correct and these variants are really heterozygous (e.g. by using additional tools to genotype them)

swomics commented 1 year ago

Thank you for taking the time to look into this further! I am running the octopus joint calling model, which may help assess the validity of the haplotypes further using the rest of the population samples, but that's probably as far as I can go in terms of additional validation, without collecting additional sequencing coverage.

I guess the main difference here between the pipelines (apart from the huge benefits of Pangenie being much easier/quicker and also genotyping complex SVs) is that octopus is calling some variants as low quality heterozygotes which can be filtered out. Whereas, pangenie is calling these same loci as maximum confidence homozygous positions.

The target was 15x, so yes it's probably about right. Looking at other chromosomes in the bam looks comparable.

2a. It's true that the average allele balance expectation is 50/50. Do you think deviations from this can be expected by chance, or perhaps through library preparation biases? The only other thing I can think of that would produce two well-mapped reads like this is some sort of demultiplexing error or contamination.

2b. I did notice the severe allele bias at some positions (such as 9/1, 8/2 and filtered out by octopus as a result). But shouldn't the genotype quality of Pangenie also be altered by the weak evidence for an alternate allele, rather than having the maximum 10,000 value?

2c. There are some nearby examples without any bias, for example 12,391,341 (4/5), but this is still called as homozygous by Pangenie. Is this due to the data at neighbouring sites?

eblerjana commented 1 year ago

re 1) PanGenie calls them with high likelihood, because given the data, the probability is quite high that these are homozygous ALT variants. This rather seems like a problem with octopus. Why it calls them 0/1 is not clear given the data. So they need to be filtered out from the octopus set because octopus reports issues with its own genotyping, but this does not mean that these are not real variants.

re 2a) I guess it's quite unlikely that all of these positions would suffer from deviations so high from the expected ratio. So the much more plausible explanation is that these are simply homozygous ALT variants. Another reason for some reads to support REF alleles at some positions may also be that some reads stem from repeated/duplicated regions with slight sequence deviations causing them to still align with high quality (but that's just a guess).

re 2b) since these positions are almost only covered by alternative alleles, the probability of a "1/1" genotype is very high, thus it totally makes sense that PanGenie calls them as homozygous ALT with high probability. Why octopus calls them as heterozygous even though the number of ALT alleles is so high, is not clear to me (the data clearly suggests otherwise).

re 2c) many of these nearby variants are genotyped jointly since they are closer that the kmersize. I guess they are genotyped as homozyous because it is the likeliest genotype configuration given the haplotypes and the kmers. Looking at the input haplotypes, I guess it makes sense, since in order to genotype these variants as heterozygous, recombination would have to be necessary, and given the anyway stronger signal of ALT kmers, PanGenie will assign them a homozygous genotype.

swomics commented 1 year ago

Thank you for your patience and the thoughtful discussion!

I suppose maybe it boils down to two different points of view on this type of situation provided by the two programs? Either these positions ought to be regarded as high-quality homozygous, because the read evidence strongly supports this, albeit with some very weak evidence to the contrary. Alternatively, an ambiguous heterozygote, which acknowledges that evidence for both alleles does exist in the read data, with the caveat of a very strong allele bias in one direction.

eblerjana / pangenie

Genome graph construction, variant filtering & phasing #33