bmansfeld / QTLseqr

QTLseqr is an R package for QTL mapping using NGS Bulk Segregant Analysis
64 stars 42 forks source link

Considerable difference between "posPeakDeltaSNP" and "posMaxGprime" numbers #16

Closed kalexiou closed 5 years ago

kalexiou commented 5 years ago

Hi Steve,

First of all let me congratulate you for this nice tool, bringing together both statistical methods developed by Takagi et al. and Magwene et al.

After running successfully the QTL analysis, I extracted the QTLs using the getQTLTable() function and, for my major QTL in chr5, I see a considerable difference between the peak positions of deltaSNP and Gprime statistics. This holds true for the minor QTLs as well. I think that, in general, these two values should be quite close, no?

Below I provide the workflow of the R script used and the resulting QTL table:

library("QTLseqr")

file20 <- "SR_filt_perc20_2.table" hb20 <- "BWA_output/R_all_20perc.bam" lb20 <- "BWA_output/S_all_20perc.bam" Chroms <- paste0(rep("chr", 7), 1:7)

df20 <- importFromGATK(file = file20, highBulk = hb20, lowBulk = lb20, chromList = Chroms)

df_filt20 <- filterSNPs(SNPset = df20, minTotalDepth = 10, refAlleleFreq = 0.2, maxTotalDepth = 150, minSampleDepth = 15, minGQ = 30) `

QTLseq analysis (Takagi et al. (2013))

df_filt20 <- runQTLseqAnalysis(df_filt20,
                                     windowSize = 4e6,
                                     popStruc = "F2",
                                     bulkSize = c(15, 15),
                                     replications = 10000,
                                     intervals = c(95, 99))

G' analysis (Magwene et al. (2011))

df_filt20 <- runGprimeAnalysis(SNPset = df_filt20, windowSize = 4e6, outlierFilter = "deltaSNP", filterThreshold = 0.2)

Plotting

plotQTLStats(SNPset = df_filt20, var = "nSNPs") + ggtitle('nSNPs - perc20_samtools') plotQTLStats(SNPset = df_filt20, var = "deltaSNP", plotIntervals = TRUE) + ggtitle('deltaSNP - perc20_samtools') + ylim(-0.8,0.8)

plotQTLStats(SNPset = df_filt20, var = "Gprime", plotThreshold = TRUE, q = 0.001) + ggtitle('Gprime - perc20_samtools')

Extract QTLs

results_20 <- getQTLTable(SNPset = df_filt20, method = "Gprime",alpha = 0.001, export = FALSE) results_20

CHROM qtl start      end   length nSNPs avgSNPs_Mb peakDeltaSNP posPeakDeltaSNP avgDeltaSNP maxGprime posMaxGprime meanGprime    sdGprime     AUCaT
1  chr3   1  3141  7533286  7530145  4234        562    0.1477947            3141   0.1058236  5.367394      4987052    5.13946  0.09072959   6933819
2  chr5   1   691 24469450 24468759 37979       1552   -0.5857466         5981863  -0.2947747 39.274411      8086713   18.43512 11.06228565 430543716
      meanPval     meanQval
1 9.428524e-06 7.545695e-05
2 1.243138e-06 9.720765e-06

Thanks Kostas

bmansfeld commented 5 years ago

That's a good question - yes in general the methods should yield close results. It's hard to guess why there is a big difference with out seeing the plots. One thing to note (and I've seen in other's data) is that Gprime is sensitive to zigzaging patterns of deltaSNP, because it is an absolute value. ie if you have a large dip in deltaSNP and then a large peak that whole region from the start of the dip to the end of of the peak is detected by Gprime as deviating from the expected 0.5 allele frequency. This can affect the center of the peak. You might want to change your smoothing window. Hope this helps. I would recommend going with the method your community is more comfortable using if you are uncomfortable with these discrepancies. Let me know if you have anymore questions. I'll leave this issue open for a while. Ben

kalexiou commented 5 years ago

Hi Ben (sorry for the Steve in my first message!)

Thanks for your quick response. You are right. When I increased the size of the smoothing window from 4.0e6 to 4.7e6, deltaSNP and GprimeSNP got closer (7678525 and 8086713, respectively). The thing is that, from previous studies, we know that the gene responsible is at ~7,2Mbp, with deltaSNP being closer to the gene than Gprime statistics.

I attach below some graphs in order to have your opinion on the quality of the data as well and/or any additional comments that you may have.

Thanks a lot Kostas

Graphs after data filtering:

REF-AF Total-DP SNPindex-HIGH SNPindex-LOW

QTL graphs

1Mbp window - deltaSNP

deltaSNP_1Mbp-window

1Mbp window - Gprime

Gprime_1Mbp-window

4.7Mbp window - deltaSNP

deltaSNP_4 7Mbp_window

4.7Mbp window - Gprime

Gprime_4 7Mbp-window

bmansfeld commented 5 years ago

Hi, Your data look pretty good to me. Keep in mind that both methods rely on recombination in your population and your population size and how strict your selection is. In my eyes having both of those peaks within 1Mb of your casual gene is an amazing result for these methods. The methods are slightly different and so the peak values are slightly different. Again, you need to decide which method you want to work with based on the literature and the norms in your community. Good luck with your work, Ben