G' Distribution Fitting

thuet206 commented 3 years ago

Hello Ben,

I have another question this time regarding G' distribution fitting. We have noticed that making relatively small changes to our SNP sets can result in fairly profound differences in G' distribution fits. For example, we have seen that two almost identical SNP sets (where individual SNPs are assigned comparable G and G' values between SNP sets) are assigned different G' distribution fits, resulting in different adjusted p.values (q.values) between the SNP sets. A first glance at the respective deltaSNP and G' maps suggest that the maps are nearly identical between SNP sets, however the only significant change between them is where the FDR threshold gets drawn. Sure enough, if you compare the two SNP sets G' distribution fits, you can see that the two SNP sets are being fit to two distinct null G' distributions. The difference in distribution fits results in 5 significant peaks in one SNP set and zero significant peaks in the other, both at an FDR of 5% with other plotting and filtering settings constant.

I should also add that we have only seen this occur for one of our 6 experiments. In each of the other experiments we made similar edits to our SNP sets, and while there were usually some slight tweaks in distribution fits, the FDR thresholds and interpretation of significant peaks were largely unchanged.

Have you encountered anything similar before? If so, any insight about where to begin would be greatly appreciated!

Best, Tanner

bmansfeld commented 3 years ago

Hey Tanner, Thanks for the comment. Could you please post some of the figures? Specifically the histograms for the filtering parameters (i.e. REF_FRQ, SNP-index...). And the plotGprimeDist() plots?

Im curious about what could be causing this. But my first thought is that you are filtering out some outlier SNPs in the data that are affecting the G'. One thing to play with maybe is the window sizes? so that the signal isn't oversmoothed and then compare the plots to see where the G' is changing.

I see that you are working with different experiments. are any of the qtl that get dropped shared with the other experiments?

Let me know if you can share the info. Hopefully we can learn something helpful from this. Ben

thuet206 commented 3 years ago

Sure,

here are post-filtered histograms and the G' distribution fit for "analysis 1":

and here are the same plots for "analysis 2":

Yes, several of the QTLs that get dropped between the two analysis were QTLs that we have seen in other experiments, so we are somewhat confident that they are real QTL.

In case it is useful for your interpretation the only thing that was changed between the SNP sets used in analysis 1 and analysis 2 was our choice of reference genome for upstream variant calling. The differences between reference genome assemblies are very minor, which has made interpreting these conflicting results troubling.

As far as window sizes go, we calculated window sizes based on species-specific recombination rate as described by Paul Magwene here: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002255

We also took Magwene's advice and generated segregant populations via multiple rounds of random mating and sporulation to increase recombination events and to justify a smaller window size (we hoped this would increase mapping resolution). We did a back of the envelope calculation based on recombination rate and rounds of random mating to calculate the W values that we have been using thus far. That said, I have noticed that increasing window size tends to decrease the FDR threshold (or its corresponding G' cutoff value) all else being equal, causing some QTL peaks to reach significance. Is this a sign that our calculated window size is too small? We feel a little uneasy about arbitrarily adjusting window sizes so that our QTLs reach significance, but are also having trouble grappling with some of the conflicting results we are seeing.

Thanks again for the help! Tanner

bmansfeld commented 3 years ago

Hey Tanner, Sorry for not responding sooner... my bad.

I can see why you have reason for concern, you'd think that nothing would be different if the references are the same genotype.. So A) if they are not the exact same line then that's an issue of course. and B) Even if there are some genotypic differences (ie from being re-cultured in the lab etc.) between the line you used as a parent and the one of the references that could be an issue we might need to address.

Thanks for these figures - which look pretty good as far as I can tell - nothing shockingly aberrant. Can you maybe share the pre-filtered ones. The issue is that something is being filtered out in round 2 and I wonder what and why.

So things to think about.

Still concerning.

The problem is that it is the references are different it's hard to compare the VCF files to find out the positions of SNPs that are lost etc.

If you run the other experiments on the new reference do the same QTL drop out? If so I'm willing to bet there is a genotypic difference at some key SNPs. Or maybe the original reference was bad? Sounds unlikely but all are options...

Sorry I dont have a better answer..

I'm curious though so let me know what you find and hopefully I can help more. Ben

thuet206 commented 3 years ago

Thanks Ben,

Some more background:

The differences in reference genome were pretty minor. For both reference assemblies used in the above analysis we sequenced one parent, aligned its reads to the publicly available reference, and created an updated assembly that included any mutations that our parent had picked up relative to the public reference. For the first analysis above we only updated the public reference for SNPs (in our parent strain relative to the public assembly) so that positionality information was conserved with respect to the public reference assembly, that way we could take advantage of GFF files and also compare our QTLs directly to literature QTLs. For the second analysis, we used a reference assembly where both SNPs and INDELs were replaced with respect to the public reference assembly, so that our reference assembly used for bulk variant calling was as close to the parent genome as possible. Since all SNPs were derived from the same parent VCF file, we believe the SNPs should be perfectly conserved between our two assemblies, while there are definitely differences in INDELs between the assemblies. The maps (QTL topography) remains nearly identical between choice of reference genome, while the FDR significance threshold changes quite dramatically. Because the these maps looked so similar we had initially assumed there must be an error in thresholding, but now I can see a scenario where removing/adding INDELs could change which SNPs share a sliding window, which could affect individual G' values, which could in turn affect distribution fitting. It just seems surprising that the changes would be as dramatic as they are while map topography remains relatively unchanged.

As far as your question about comparing to our other experiments, we see the same trend. The QTL peaks themselves do not rise or fall (at least not to the naked eye) but the horizontal FDR-significance threshold line shifts pretty dramatically.

Here are the unfiltered SNP distributions:

analysis 1 (only parent SNPs replaced in the reference)

analysis 2 (parent SNPs and INDELs replaced in the reference)

Another thing is that it looks like we have pretty clear evidence for mapping bias to the reference parent regardless of chosen reference assembly. We have thought about aligning bulk reads to both parents' reference genomes and subsequently removing all reads that couldn't unambiguously map to both parent references, hoping to remove some of the apparent mapping bias that way. Have you ever tried anything similar?

Thanks again for the help! Tanner

bmansfeld commented 3 years ago

Hey Tanner, Including the indels shuffles things around. It can be quite substantive after including thousands if not millions of INDELS of length 1-50 bp. That being said if the topography (i understand this as the delta-snp index does seem to change muc) is the same im surprised the Gprime threshold is completely different. I'm not a huge fan of how Magwene suggested to identify outlier regions to define the p-value but thats just the method... What do the deltaSNPindex thresholds look like? are you running that analysis too? Again im sorry i dont have a better answer. I would do a bit of digging see if you can tell which G' values change drastically. It seems from the Gdist plot that you are filtering more (?) G' values as outliers and that is whats causing the null distribution to be more strict? this is my hypothesis right now. In in the past I've done some debugging where i ran the filtering scripts separately from the pipeline to see what gets pushed out as an outlier. This is how we developed the deltaSNP approach to outlier filtering.

Let me know if you find something. If you can figure out a bug or some reason that things dont make sense i will do my best to fix it. Ben

thuet206 commented 3 years ago

Hi Ben,

Yes it seems plausible that the indels are throwing things off. There are a total of 151 indels (of variable length) between the assemblies. Not a huge number, but definitely enough to make a difference.

The DeltaSNP maps (and thresholds) look much more convergent between reference assemblies. For now we are more confident in those results and will be proceeding with the peaks derived from the QTL-seq approach for downstream analysis. But we agree, considering how similar the DeltaSNP maps are we would think the G' maps would be more agreeable.

We are currently taking a closer look at how the null distribution is derived and are playing around with different mode estimators. We will look into filtering differences between the assemblies too.

As always, thanks for the help. Will keep you updated as we go. Cheers, Tanner

bmansfeld commented 3 years ago

I'll leave this issue open if you want to discuss further

Thanks good luck!

bmansfeld / QTLseqr

G' Distribution Fitting #38