bmansfeld / QTLseqr

QTLseqr is an R package for QTL mapping using NGS Bulk Segregant Analysis
64 stars 42 forks source link

Handling experimental replicates #36

Closed thuet206 closed 4 years ago

thuet206 commented 4 years ago

Hello again Ben,

I don't have a technical problem this time, but I do have a statistics / theory question for you! Do you have any suggestions on how to handle multiple experimental replicates under a G' statistical framework? For example, say you performed two entire BSA experiments on two distinct segregant populations (derived from the same parents), how would you best leverage your replicate data? My first thought was to simply pool all the reads in the upstream read processing pipeline in order to generate VCF files that contain both experimental replicates, although this seems problematic for several reasons. My next thought was to calculate a raw G statistic for each SNP in both replicate SNPsets independently and then use the mean G value for each SNP during the G' sliding-window calculation, although this doesn't seem like a perfect approach either.

As always, thank you in advance for your help!

Tanner

PS - I took up learning some GGplot2 (thanks again for the advice) and used your QTL-map-plotting source code to figure out how to plot multiple experimental replicates on a single QTL map. Although, the replicates are a little noisy which is part of what is motivating us to consider both replicates in a single statistical framework

bmansfeld commented 4 years ago

Hey Tanner, Thanks for the questions. I think all your thoughts are on the right track. Unfortunately, I don't have a perfect answer for you, but can advise from some of my personal experience and can end with a suggestion of something I've never tried but could be promising. So, as for option No1, pooling data from two experiments and rerunning the analysis. I've done this before. The easiest and maybe (?) most accurate was to do this is to change the read groups in the bam files using picard or some other tool so that they say that the different years are the same sample then going through the GATK pipeline anew. Perhaps not surprisingly, the results at least in my experience yield an approximate mean of the two experiments. In my case, peaks that existed in one experiment and not the other had their delta-SNP values cut in half etc. That is to say, I doubt you will magically resolve any incongruencies between the data sets.

As for option No2, This is possible and worth a shot but I would consult your local statisticians about checking if this invalidates any assumptions or something. I would imagine that the plots would be as above, though - the means of the two lines.

These leads me to another option, which I have not yet tried which was discussed in issue #3 . The CMH test is a similar test to G' and Chi^2 and it might be worth looking into, but this would be a whole new type of test. I'm not sure if the approach suggested by Magwene et al (2011) for G' is directly applicable to CMH. Again, I would advise discussing this option with a local statistician. I've been thinking about including something like this in the package, but haven't the time. Let me know if you go this rout, I would be curious to see what you find.

Good luck, Ben

thuet206 commented 4 years ago

Thank you Ben for your quick response. My apologies, I hadn't seen the issue #3 thread. CMH looks promising and might be something we look further into. I will be sure to fill you in after talking over statistics with our team. I agree that both of the other approaches are limited in that they simply provide a mean without any way to capture experiment-experiment variation. We have already compiled overlapping QTL regions between replicates but we were hopeful a more quantitative method might exist somewhere. If we come across any options that look promising will be sure to share!

Thanks again for the help. Best, Tanner