AdmiralenOla / Scoary

Pan-genome wide association studies
GNU General Public License v3.0
147 stars 35 forks source link

Question about --collapse #94

Open Antonia-Chalka opened 3 years ago

Antonia-Chalka commented 3 years ago

I have a very basic question about how the --collapse flag determines grouping. Does it collapse genotypes that have the exact same distribution across all the samples, or is some other type of correlation statistic used to determine that (and if so, what is it and what is the threshold)?

Both readme and the paper note the following:

For each phenotype supplied via columns in the traits file, Scoary does the following: first, correlated genotype variants are collapsed. Plasmid genes, for example, are typically inherited together rather than as individual units and Scoary will collapse these genes into a single unit.

Antonia-Chalka commented 3 years ago

From a quick view at the code in the methods script, it seems the correlation has to be perfect, but there's also a mention of having a 'softer' mention so I'm not 100% sure 😅

AdmiralenOla commented 2 years ago

Thanks for your question, and sorry about the wait.

As you have already figured out, the genotypes need to be 100% correlated to be collapsed. You may also have seen from the code that I thought about using a softer threshold, but I have never gotten around to implementing that.

I'm also a bit uncertain how the distribution of the collapsed variant should be counted, i.e. should it be present in all isolates with either of the original variants? I'm uncertain how that would impact other assumptions that are made.

Another thing I'm not sure about is whether the collapsed genes should then go through subsequent rounds of correlation -> collapse. That is, when we collapse two genes into one, this will have a new distribution pattern, and there is a chance that this new pattern will fall within the correlation threshold of being collapsed with yet another gene.