Open alimanfoo opened 7 years ago
Has there been any progress regarding missing data?
Hi @jregalad-o, I'm not sure there is an easy fix, unless you are aware of a general PCA implementation that can handle missing data. The standard workaround is to subset your variants and/or samples to ensure there are negligible levels of missingness. An extra check you can then run is to do a PCA on missingness, to see if there is any structure to the missingness or it is random.
Did you have any other thoughts/ideas?
The allel.stats.decomposition.pca function does not allow for missing data in the genotypes array. If I build the geno array using geno = genotypes.to_n_alt() it will work, but with geno = genotypes.to_n_alt(fill=-1) it does not work. The problem with the first way though is that the default fill=0 will make the missing data appear as homozygous for the reference allele, which will greatly bias the results.