genetics-statistics / GEMMA

Genome-wide Efficient Mixed Model Association
https://github.com/genetics-statistics/GEMMA
GNU General Public License v3.0
330 stars 124 forks source link

Allow and describe options for disabling filters #234

Open pjotrp opened 4 years ago

pjotrp commented 4 years ago

Gemma has a number of filters, including maf, miss, R2 and also conversions, such as centering which need the option of disabling.

pjotrp commented 4 years ago

These are the existing switches

GEMMA 0.98.3 (2020-09-29) by Xiang Zhou and team (C) 2012-2020
 SNP QC OPTIONS
 -miss     [num]           specify missingness threshold (default 0.05)
 -maf      [num]           specify minor allele frequency threshold (default 0.01)
 -hwe      [num]           specify HWE test p value threshold (default 0; no test)
 -r2       [num]           specify r-squared threshold (default 0.9999)
 -notsnp                   minor allele frequency cutoff is not used
pjotrp commented 4 years ago

GEMMA has a simplistic poly filter which simply removes genotype rows that are identical, i.e. carry one single genotype. No reason to make that optional.

pjotrp commented 4 years ago

I added some documentation in above commit. Essentially disable above filters with

gemma -r2 1.0 -hwe 0 -miss 1.0 -notsnp ...
pjotrp commented 4 years ago

Computing the GRM modifies the genotypes (after applying above filters) before computing:

  1. Plug in the mean for missing genotypes
  2. Subtract the mean from all genotypes
  3. Scale the genotypes (on -gk 2, skips on -gk 1)

As you can generate your own GRM to load in GEMMA there is probably no point in disabling these.

mpala80 commented 3 years ago

Hi, After specifying flags for filtering I still miss SNPs number of total SNPs/var = 53211 number of analyzed SNPs/var = 47054

this is the command line Command Line Input = gemma -g imputed_genotypes.mgf.gz -lmm 1 -k kinship.cXX.txt.gz -maf 0 -hwe 0 -miss 0 -r2 1 -notsnp -p phenotype.phe -outdir gemma_test -o test

below there is the log

thank you, Mauro

GEMMA Version = 0.98.1 (2018-12-10) Build profile = GCC version = 8.2.0 GSL Version = 2.5 Eigen Version = 3.3.5 OpenBlas = OpenBLAS 0.3.2 - DYNAMIC_ARCH NO_AFFINITY Sandybridge MAX_THREADS=6 arch = Sandybridge threads = 6 parallel type = threaded

Command Line Input = gemma -g imputed_genotypes.mgf.gz -lmm 1 -k kinship.cXX.txt.gz -maf 0 -hwe 0 -miss 0 -r2 1 -notsnp -p phenotype.phe -outdir gemma_test -o test

Date = Thu Apr 22 11:50:18 2021

Summary Statistics: number of total individuals = 7898 number of analyzed individuals = 5537 number of covariates = 1 number of phenotypes = 1 number of total SNPs/var = 53211 number of analyzed SNPs/var = 47054 REMLE log-likelihood in the null model = -7658.62 MLE log-likelihood in the null model = -7659.27 pve estimate in the null model = 0.412219 se(pve) in the null model = 0.0256184 vg estimate in the null model = 1.32772 ve estimate in the null model = 0.587324 beta estimate in the null model = -0.000792455 se(beta) = 0.0102992

Computation Time: total computation time = 22.0958 min computation time break down: time on eigen-decomposition = 3.49347 min time on calculating UtX = 5.42422 min time on optimization = 6.84549 min

pjotrp commented 3 years ago

I don't have time to look into it now, but do note that gemma is pretty logical and dropping genotypes has its reasons.