cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
288 stars 51 forks source link

PCA with missing genotypes #143

Open alimanfoo opened 7 years ago

alimanfoo commented 7 years ago

The allel.stats.decomposition.pca function does not allow for missing data in the genotypes array. If I build the geno array using geno = genotypes.to_n_alt() it will work, but with geno = genotypes.to_n_alt(fill=-1) it does not work. The problem with the first way though is that the default fill=0 will make the missing data appear as homozygous for the reference allele, which will greatly bias the results.

ghost commented 5 years ago

Has there been any progress regarding missing data?

alimanfoo commented 5 years ago

Hi @jregalad-o, I'm not sure there is an easy fix, unless you are aware of a general PCA implementation that can handle missing data. The standard workaround is to subset your variants and/or samples to ensure there are negligible levels of missingness. An extra check you can then run is to do a PCA on missingness, to see if there is any structure to the missingness or it is random.

Did you have any other thoughts/ideas?