AdmiralenOla / Scoary

Pan-genome wide association studies
GNU General Public License v3.0
147 stars 35 forks source link

Large number of significant genes #99

Closed LeonardosMageiros closed 2 years ago

LeonardosMageiros commented 2 years ago

Hi,

I have a dataset of ~1500 S aureus strains from different hosts. I execute Scoary using the output of Roary but I think I get too many significant results. Specifically I have ~1500 genes with Bonferroni P value < 0.01and this only reduces to ~1400 genes when I bring the threshold down to 0.001.

Is this normal? Is there an explanation of why I have so many genes?

For your ease I only use the -g and -t input files and everything else at the default value. My phenotype in the -t file is binary of course.

Is there any chance to reduce this if I supply a phylogeny tree?

Thank you very much in advance for your time and help.

Kind regards Leonardos

AdmiralenOla commented 2 years ago

Hi Leonardos,

I doubt this would change much by supplying a phylogenetic tree to scoary.

The Fisher/Bonferroni p-values are assuming that your isolates are all equally related to each other, or at least that there's no systemic pattern of relation between them that could influence the association between genes and traits. This of course is never the case in real life.

It is likely that your data has a high degree of pseudoreplicates, isolates that are clustered on a phylogenetic tree and that share a lot of their genetic make-up, as well as phenotypic characteristics. I don't know so much about S. aureus, but I'm assuming you have nearly complete separation by host? In other words, all the human isolates cluster together, all the dog ones (or whatever host you have), etc?

In that case it's not unheard of to find that many genes associated with each of the host types. Just remember that there is not necessarily any causal link between the genes you find and specialization to the host niche.

All the best, Ola