1.Original column names
List the original column names and their corresponding names in the QC’d file. The purpose is to be able to identify column labelling errors, e.g. for effect allele and effect allele frequency.
2.Number of monomorphic SNPs
Count the number of monomorphic SNPs, i.e. with MAF= 0 or 1
Number of SNPs missing information
Count the number of SNPs missing a pvalue, beta, se eaf, effect allele and other allele
How many SNPs had nonsense values:
alleles other than ‘A’,’C’,’G’ or ‘T’
P-values <0 or >1
negative or infinite standard errors (<=0 or =Infinity)
infinite beta estimates or allele frequencies <0 or >1
MAC<6
How many SNPs had MAC<6? MAC = 2NMAF, where N is sample size and MAF is minor allele frequency.
Duplicate SNPs
Count number of duplicate SNPs
SE_N metrics
In theory the inverse of the median standard error of the beta estimates across all SNPs should be proportional to the square root of the sample size:
SEj^2=var(βj)=var(Y)/(Nj*var(Xj))
Where SEj^2 is the squared standard error the beta estimate for SNP j, βj is beta for SNP j, var(Y) is variance for the phenotype Y, Nj is sample size for SNP j and var(Xj) is genotype variance for SNP j.
Assuming that a given SNP sample size is close to max sample size for all SNPS:
median(SEj) = (sd(Y)/sqrt(N))median(1/var(Xj))
sqrt(N) = (csd(Y))/median(SEj)
Where c = median(1/sqrt(var(Xj)))
7a. Calculate the following metrics:
N = reported sample size for study or max N across all SNPs
N_rep_sqrt= sqrt(N)
med_se = median(se)
sd_Y = 1 # we assume variance is 1
MAF = minor allele frequency
var_X = 2MAFj(1−MAFj)
C = median(1/sqrt(var_X))
N_est_sqrt = (csd_Y)/med_se
N_est= N_sqrt^2
sd_Y_est1 = (sqrt_Nmed_se)/c
7b. Report the following metrics:
N
N_est
N_est_sqrt
N_rep_sqrt
ratio_se_N
sd_Y_est1
sd_Y_est2
sd_Y_rep #this is SD in Mr-Base study table
We expect ratio_se_N to be 1. When it is not 1 the following problems could apply:
Study phenotype was not standardised. Variance of phenotype is not 1.
the study’s phenotypic variance differs from other studies, which might be explained by a different study design or special study population;
the study’s MAFs differ from other studies, which might be explained by a diverging genotyping platform, reference panel for the imputation, or a different ethnicity
the study’s SNP imputation qualities differ from those of other studies, which might reflect errors in the imputation or a different reference panel;
the study’s effective sample size differs from the stated sample size, which might be due to unaccounted relatedness between study participants or mis-coded sample size;
the study analyst has used a different statistical test; or
the study analyst has mis-specified the phenotype transformation or the regression model, which results in a different phenotype variance or residual variance
P-Z plot
For each SNP, compare the reported P-values with the P-values computed from the Z-statistics based on reported beta-estimate and standard error (Z statistics = βj/SE (β)j)
AF plot
Plot the allele frequency of each SNP against a reference allele frequency.
Count number of SNPs in the GWAS where their AF deviates >20% from reference:
abs(af_gwas-af_ref)/af_ref > 0.2
How many SNPs have mismatching alleles? E.g. AC in GWAS but AT in reference?
Plotting reported allele frequencies against a reference set, such as from the HapMap or 1000 Genomes projects, can help to visualize patterns that pinpoint strand issues, allele miscoding, or the inclusion of individuals whose self-reported ancestry does not match their genetic ancestry.
Genomic control / lambda
GC-lambda = median(qchisq(pvals, df=1, low=FALSE)) / qchisq(0.5, 1, low=FALSE)
GC-lambda greater than 1.1 require further investigation
Population stratification can either inflate or deflate association P-values and can be grasped by the genomic control (GC) inflation factor (λGC). As λGC increases with sample size in the case of polygenic phenotypes. Interpretation of GC-lambda unclear when study used a targeted array.
Any studies with r2_sum >0.5 should be investigated very closely
Studies with r2_sum>0.1 may also require investigation
How many SNPs had a Pvalues < 5e-8?
N_p = length(which(pval<5e-8 ))
Where this number is “high” further investigation is required. Interpretation of this value depends on other metrics in the report, such as sample size, sum_r2 and GC-lambda.
Compare effect alleles with GWAS catalog
Merge file with GWAS catalog on pmid.
Where EAmrbase and EAgwascat different, flip the beta by *-1. Then compare BETAgwacat versus BETAmrbase. Produce two tables for rows that match, and for when they don’t match. Matching means the directions of effect are the same (the betas won’t necessarily be identical).
How many match and how many don’t match?
From Phil's document