Closed rsbrennan closed 5 years ago
For genotype data, the principle of scaling is to avoid to give too much importance to common variants, which will have maximum variance. By scaling, you give the same importance to each variant. We use the standard Patterson's scaling.
For pooled data, we have evaluated scaled and non-scaled versions and the better performance was provided by the non-scaled version. Let us assume that you have 3 pops. If you scale, variants with frequency 0.95, 0.5 and 0.05 will have a similar test statistic (and p-value) than a variant with frequency 0.51, 0.5, 0.49 and to avoid that, we do not scale in the pooled version.
makes sense. appreciate the response.
I'm wondering about why the PCA for individual genotypes use centered and scaled data while allele frequencies of pooled data are not scaled or centered (for example, here: https://github.com/bcm-uga/pcadapt/issues/11).
I'm asking because when I use scaled vs. unscaled frequencies the relative relationships between samples changes. See attached plots. For example, on PC3 for the unscaled allele frequencies, the square samples are most divergent. When using scaled frequencies, the circle samples are most divergent along PC3. When running PCAdapt and restricting SNPs to PC3, I pull out only variants that are highly divergent in allele frequency between the squares (which does make sense). I should note that I'm most interested in PC3 due to the experimental design.
If I run a PCA on these data using prcomp and pull out the loadings for the scaled and unscaled analysis, I get very different sets of variants that are driven by the shape that is farther apart in PC space. The unscaled PCA from prcomp is identical to the PCAdapt PCA (as far as I can tell).
I understand that scaling is typically used when the measures going in have different variances or are of different magnitude. Naively, I thought allele frequencies would not need to be scaled.
My question: why does PCAdapt consider using unscaled frequencies the correct approach in pooled data but scaled frequencies when using individual genotypes? Generally, I am surprised that the set of variants identified seems to differ so much between the approaches.