caitiecollins / treeWAS

treeWAS: A Phylogenetic Tree-Based Tool for Genome-Wide Association Studies in Microbes
Other
94 stars 18 forks source link

Huge peak in the distribution of terminal scores #55

Closed pchamely-2 closed 3 years ago

pchamely-2 commented 3 years ago

Hi there, I ran treeWAS on a set of Clinical (32) and environmental (59) isolates of Bacillus cereus, using the output of CD-HIT clustering to generate my gene dataset. I noticed from the output of the treeWAS run that a large proportion of the gene's have the exact same association score, linked to the environmental phenotype (-0.29)

Screen Shot 2020-11-20 at 8 36 01 AM

This is also seen in the huge peak when plotting the distribution of terminal scores.

000002

This seemed a bit strange to me and I was wondering if I could be the fact that the number of isolates that we have for each phenotype is a bit unbalanced? (i.e. because we have 17 more environmental isolates there will be more things associating with that phenotype?)

null_distribution