Genomic associations between subspecies

karchern commented 6 years ago

From what I read, scoary is currently not able to work with non-binary traits. I want to use scoary in order to determine the pangenomic differences between three apparent subspecies of my bacterium of interest. There appears to be a pretty strong signal, as the genomes cluster distinctly in a PCoA based on gene presence / absence data. Specifically, I would like to find out which genes are differentially prevalent between the three clusters. Can I supply a trait file that has "dummy variables", something like this. My approach should work if scoary simply removes those samples that have no information for a specific trait. What do you think about this?

Sample_name        Comp_clust_1_2             Comp_clust_1_3                Comp_clust_2_3
member_cluster_1     0                         0                         NA/empty
member_cluster_1     0                         0                         NA/empty
...
...
member_cluster_2     1                         NA/empty                 0 
member_cluster_2     1                         NA/empty                 0
...
...
member_cluster_3     NA/empty                 1                         1
member_cluster_3     NA/empty                 1                         1

AdmiralenOla commented 6 years ago

It is indeed possible to use scoary this way.

I would recommend using the --no_pairwise flag if you do this, since pairwise comparisons as implemented in scoary do not really make sense if you're looking at enrichments in groups rather than looking at variants with a causal hypothesis (The presence of a certain gene CAUSING the phenotype).

By splitting your genomes into a sort of pseudo-phenotype using PCoA, as you have done, you are (to a certain extent) handling spurious findings from population structure. This is similar to what is done in many (most) GWA studies.

A possible problem would arise if, within one of your clusters (prinicipal components), you had some fairly different genomes and then a bunch of almost identical outbreak genomes. Then your results might show enrichment of genes present only in the outbreak genomes, even if these are lacking in the other genomes within the same cluster. I tthink one way of handling this could be to add more principal components. (Which in most cases correspond well to lineages).

I hope this made any sense, and if not please fire away!

karchern commented 6 years ago

Hi Ola, thank you very much for your detailed answer.

Am I right in assuming that running scoary with the --no_pairwise flag is essentially equal to running a fisher-test for each gene WITHOUT taking into account the structure of the (phylogenetic) tree of samples (as scoary would do without setting the --no_pairwise flag)?

Cheers, Nic

AdmiralenOla commented 6 years ago

That's correct!

This would be an adequate way of measuring between-group differences in gene enrichment unless you have a large number of pseudoreplicates within your groups.

karchern commented 6 years ago

Thanks a lot, Ola! I'm closing this issue

AdmiralenOla / Scoary

Genomic associations between subspecies #64