caitiecollins / treeWAS

treeWAS: A Phylogenetic Tree-Based Tool for Genome-Wide Association Studies in Microbes
Other
94 stars 19 forks source link

How does treeWAS perform the Bonferroni correction? #11

Closed cizydorczyk closed 6 years ago

cizydorczyk commented 7 years ago

Hello,

I am very interested in applying treeWAS to a data set of P. aeruginosa isolates I have, but have run into a (hopefully) quick question.

I was wondering if you could provide some background into how treeWAS performs the Bonferroni correction for multiple testing?

More specifically, does it use the total number of SNPs (or genes if working with gene presence/absence data) to correct the p-values from the 3 significance tests? Or does it use some reduced form?

By reduced I mean, for example, a case where two SNPs follow the same pattern in the data set, and could be counted as a single "unit" and therefore the correction would count both as a single "test", rather than two separate but identical tests.

Any response is greatly appreciated!!

Thank you,

Conrad Izydorczyk

caitiecollins commented 6 years ago

Hi Conrad,

Good to hear from you. Very sorry for the delay in getting back to you-- I've been travelling around and conference-ing. Thanks for your patience.

But to answer your question, I should start by saying that we don't directly correct a set of previously-calculated p-values from the association tests.

What we do is, first, assign p-values to each association score value in the null distribution. We do this in a count-based manner (explained in this previous issue).

Once we have this p-value distribution, we apply the Bonferroni correction to get the significance threshold. So instead of drawing the threshold below the top 0.01 scores in the null distribution, we draw it below the top 0.01/(n.SNPs * n.Scores), where n.SNPs would be the number of loci or columns tested in your genetic data matrix and n.Scores would be 3 by default for the three association scores run by treeWAS.

Your suggestion that we might want to run the Bonferroni correction only on the number of unique columns rather than the total number is an interesting one. Of course for efficiency reasons we run the tests only on unique columns, but at the moment we are applying the Bonf correction to all columns. I will have to think for a moment about my justification for this particular choice. But for now I will leave you with this response, as I know you have been waiting for a while to hear back while I was away.

I'll be happy to answer any further questions though.

Best wishes, Caitlin.

caitiecollins commented 6 years ago

Regarding our decision to apply the Bonf correction to the total number of sites rather than the total number of unique sites, I don't think the unique-only approach is an entirely technically appropriate use of Bonferroni.

Suppose we had 10 loci with one identical pattern. If we found one instance of this site to be significant, it's not necessarily the case that we believe with equal confidence that all 10 are significant.

Think about how the Bonf correction is applied within treeWAS --- to determine the location of the significance threshold within the upper tail of the null distribution. The null distribution is effectively just a histogram of the association scores from the internally-simulated dataset. So, when determining the Bonf-corrected threshold should lie, it is not only the values of these "null" association scores that matter, but also the number of loci with association scores at each value. This is why we have chosen to apply the Bonferroni correction to all sites rather than only to unique sites.

I can see that there may be at least some intuitive arguments that might support the unique-only approach. I'd be more than happy to hear the other side, particularly if you have any statistical/theoretical reasoning that might incline us towards this alternative. It wouldn't be hard to implement, if you think you'd like to have this as an option.

Cheers, CC

cizydorczyk commented 6 years ago

Hi Caitlin,

My only argument for implementing the Bonferroni correction to only unique sites would be that identically segregating sites represent a single unit of association, rather than several independent units. As such, rather than identifying a single SNP associated with a phenotype, you would be associating a “block” of SNPs. But I think you are right – I was approaching this from a basic/intuitive perspective, I unfortunately do not have a very extensive statistical background.

If I may ask one more question – I seem to get drastically differing results based on the starting seed I use to run TreeWAS. Loci that appear as significant in one run by one or more of the statistical tests in one run may not appear in any other runs. I would think this is due to the random component of TreeWAS, but I was not expecting that level of variance.

I was wondering if you could provide some advice on how to interpret such results.

Any response/help is much appreciated!

Best regards, Conrad Izydorczyk

From: caitiecollins Sent: Monday, November 20, 2017 11:43 AM To: caitiecollins/treeWAS Cc: cizydorczyk; Author Subject: Re: [caitiecollins/treeWAS] How does treeWAS perform the Bonferronicorrection? (#11)

Regarding our decision to apply the Bonf correction to the total number of sites rather than the total number of unique sites, I don't think the unique-only approach is an entirely technically appropriate use of Bonferroni. Suppose we had 10 loci with one identical pattern. If we found one instance of this site to be significant, it's not necessarily the case that we believe with equal confidence that all 10 are significant. Think about how the Bonf correction is applied within treeWAS --- to determine the location of the significance threshold within the upper tail of the null distribution. The null distribution is effectively just a histogram of the association scores from the internally-simulated dataset. So, when determining the Bonf-corrected threshold should lie, it is not only the values of these "null" association scores that matter, but also the number of loci with association scores at each value. This is why we have chosen to apply the Bonferroni correction to all sites rather than only to unique sites. I can see that there may be at least some intuitive arguments that might support the unique-only approach. I'd be more than happy to hear the other side, particularly if you have any statistical/theoretical reasoning that might incline us towards this alternative. It wouldn't be hard to implement, if you think you'd like to have this as an option. Cheers, CC — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

caitiecollins commented 6 years ago

Hi Conrad,

Yes, I see the appeal of the perspective you offer on the unique-only approach. Though I do think that theoretically the all-sites approach seems more firmly-grounded. I also generally prefer to err on the conservative side, so that users can be as confident in their results as possible. The down-side of the more conservative all-sites approach would of course be that you might reject some true positive findings. In practice, at least in looking at the results of a lot of tests on simulated data, I have found that actually this more conservative approach does not have a very serious or even very noticeable effect on results/performance. Then again, if you'd like to try the more permissive unique-sites approach, I could make it an option.

Regarding the variability you are seeing, you're correct about its source. The seed affects the internal data simulation process and results will vary with different seeds. That said, you don't want results to vary too wildly. I have occasionally seen one or two significant findings disappear and reappear with different seeds, but this is probably a good indication that you may want to try increasing the number of loci being simulated.

By default, the "n.snps.sim" argument is set to be 10 times the number of columns in your dataset. Particularly with smaller datasets, you may get a poorly-defined null distribution. To get a less variable null distribution (and thus more stable results), you can try increasing n.snps.sim to 50 or 100ncol(snps). This should give you more reliable results. If it doesn't, or if you think it is behaving strangely, just let me know and we'll sort it out.

Cheers, Caitlin.

caitiecollins commented 6 years ago

Hi Conrad,

I'll definitely try and sort this out for you. I just wanted to ask if you might be able to send me your snps, phen, and tree for the relevant analysis/analyses? I would probably be able to get a better sense of the problem and hopefully track down its origin more quickly. Would you be comfortable in sharing your data with me either here or at caitiecollins@gmail.com?

Cheers, CC