AdmiralenOla / Scoary

Pan-genome wide association studies
GNU General Public License v3.0
147 stars 35 forks source link

P value calculations #21

Closed alexjironkin closed 8 years ago

alexjironkin commented 8 years ago

Hi,

Nice idea for utilising Roary output to do GWAS, but I am a little concerned that people are going to be using this and reporting results when it's not appropriate to do so.

For example, you have already pointed out that Fisher's test is just not appropriate for population structure reasons. To an untrained eye P-values are good, ergo there is good evidence for the hypothesis. When in actual fact, that is not an appropriate statistics to apply..

It's a neat attempt to deal with this using non-intersecting contrasting pairs, although it would be nice if you have mentioned a definition for what that actually is. You are still faced with the same problem (perhaps worse) of not really dealing with population structure, and applying a test that hinges on independence of trials. Plus, you have picked p=0.5 for binomial distribution, but do not justify it! Is it appropriate for all kinds of trees? Is it species specific? Why not 0.3657849, or 0.6903453?

Obviously I wish you luck and hope you find good solution for dealing with population structures, but people who are naive to stats and looking for low P values this is a dangerous tool. It should be made clear to talk to statistician/bioinformatician if they don't know what they are doing and just want to apply to their Roary analysis.

I hope you would add python bindings to existing tools for doing this from Roary output:

https://github.com/jessiewu/bacterialGWAS and https://github.com/sgearle/bugwashttps://github.com/sgearle/bugwas are from the same paper.

AdmiralenOla commented 8 years ago

Hi,

Thank you for mentioning your concerns. First let me say that the readme was intended by me as a quick start guide, not a scientific article or a textbook in statistics. I might consider expanding it or creating a wiki where the different processes under the hood are justified and explained more thoroughly.

I apply Fisher's test sort of like a filtration step. Genes whose Fisher's test p-values are lower than the cutoff are passed on to the pairwise comparisons algorithm. I think this is reasonable. If there is no apparent association between a gene and a phenotype when isolates are independent (i.e. a star phylogeny) then it's unlikely that you will find an association post-population structure-correction.

You go on to say that pairwise comparisons does not really deal with population structure, and this is where I have to disagree with you. How so? It is true that it does not use branch lengths, but I don't see how that is equivalent to saying that it is a way of not dealing with the problem. The test then makes the statistically unreasonable assumption that the pairs chosen are independent, this is true. But this is not something that I have conjured out of a hat, it is methodology that have been used for over 20 years (possibly longer), and is considered superior to other "population structure correction" techniques by many. (See for example Maddison and FitzJohn, 2014). The same article explains the shortages of methods using phylogeny decomposition and specified evolutionary models.

p=0.5 is not species-specific. The algorithm first selects pairs that contrast in gene and trait status, (In other words, both AB-ab and Ab-aB pairs are chosen) but it is blind to whether pairs are AB-ab or Ab-aB. It then, post-finding the maximum number of pairs, counts the number of AB-ab and the number of Ab-aB pairs. A priori, if A was independent of B, there would be no reason to assume that this would be anything other than 0.5. (+/- random variation).

I will note that my documentation could be improved, but I remain adamant that my approach is both correct and up-to-date with current understanding of the subject.

alexjironkin commented 8 years ago

Thanks for your reply,

  1. My point about Fisher's test is that because it violates a fundamental assumption required for the test how can you assign confidence in the P-values it returns?
  2. I think you missed my point, or I didn't communicate it well, it "sort of" deals with population structure depends on your data. If you are looking at the family or genus level comparison, that probably works quiet well. If you are looking at clonal species like salmonella, just the topology of the tree doesn't work anymore. This is why it has been used for 20 years, but in studies where you can reasonably assume your samples might as well be independent. For example, isolates from patients and outbreaks that are almost identical. It all really depends on the data and the population at hand. To my mind, there is no universal method that solves this issue.
  3. Again, what I am trying to say both 0.5 and Binomial test selection is ad-hoc. I guess you can kind of say p=0.5, because why would you assume otherwise. My point is: is this a test that can be generally applied? Probably not. If it is, that needs to be justified by fitting the data to the model, then you can state with a reasonable degree of confidence that that test and that set of parameters is justified. For example, why not Piosson? or even standard normal? Plus when looking at the core/accessory genome there are going to be gaps as a function of inability to call gene, not because it's really missing. Which means, you will start defining more and more non-intersecting contrasting pairs, when actually that's not the case.
alexjironkin commented 8 years ago

I was thinking a little more about the p-value for binomial, and I don't think you can assume 0.5 either. The prior is again linked to the population and the species in question. For example, for core genes it should be something higher, because you expect core genes to be associated with trait. For accessory genes, like plasmid/virulance/resistance genes, you probably wouldn't expect it that high. Of course, each one would be specific to gene/data at hand. So I don't think 0.5 is justified, really.

I wonder, if you could bootstrap the contrasting pairs across the whole tree.

AdmiralenOla commented 8 years ago

Thanks for your replies.

  1. The p-values from Fisher's test are not meant to be interpreted independently of other information. I have changed their name to "Naive_p" to emphasize that now. Like I mentioned, the main function of the test is as a filtering procedure to decide which genes are tested with pairwise comparisons. However, in clonal populations it might actually be interesting in itself if the internal population structure is negligible.
  2. To me, this seems to be an argument in favour of reporting p-values from both Fisher's test and from pairwise comparisons. Fisher's test would be valuable for clonal populations, while pairwise comparisons would be valuable in structured populations. Scoary can't decide which is more appropriate for a given sample, and I really wouldn't have any idea about how to set such a cut-off. This is why both p-values are reported.
  3. I don't see how poisson or standard normal could be more relevant here, and I wonder if it is because of my admittedly poor explanation of the test logic. The algorithm first finds the maximum number of non-intersecting pairs that contrast in both gene value and state value. (At this point there is no p-value attached to the test.) We then look at what states could actually make up such a pairing. Since there could be many (hundreds or thousands) arbitrary ways of selecting pairs, we find the best and worst possible pairing. Imagine that you know you have a contrasting pair but you don't know the states. (AB-ab or Ab-aB). If you think that A -> B, then an AB-ab pair would be a success and a Ab-aB pair would be a non-success. I think this is quite clearly a Bernoulli type experiment, and if the state of A/a was independent of B/b then you would expect p=0.5 for both possible pairs. Thus, the binomial test with p=0.5 is the logical choice.

You say "when looking at the core/accessory genome there are going to be gaps as a function of inability to call gene, not because it's really missing. Which means, you will start defining more and more non-intersecting contrasting pairs, when actually that's not the case." No, this shouldn't happen. Yes, the gene will be uncalled in some isolates, but this will not necessarily systematically inflate the number of possible pairs unless there's a correlated measurement error in the phenotype. Otherwise it will just behave as "noise", lowering your sensitivity.

I disagree with you that 0.5 is not justified. This null hypothesis tested in that binomial test is that of independence between A and B. Not "A is a little bit related to B, but not too much, and a bit more if it is a core gene.". Scoary merely decides if there is evidence for an association, and (as with all software that I'm aware of) the biologists will have to be the ones who decide if this is an interesting association or not.