QC metrics - Githubissues

From Phil's document

MR-Base QC metrics Jan 2019

1.Original column names List the original column names and their corresponding names in the QC’d file. The purpose is to be able to identify column labelling errors, e.g. for effect allele and effect allele frequency.

2.Number of monomorphic SNPs Count the number of monomorphic SNPs, i.e. with MAF= 0 or 1

Number of SNPs missing information Count the number of SNPs missing a pvalue, beta, se eaf, effect allele and other allele

How many SNPs had nonsense values: alleles other than ‘A’,’C’,’G’ or ‘T’ P-values <0 or >1 negative or infinite standard errors (<=0 or =Infinity) infinite beta estimates or allele frequencies <0 or >1

MAC<6 How many SNPs had MAC<6? MAC = 2NMAF, where N is sample size and MAF is minor allele frequency.

Duplicate SNPs Count number of duplicate SNPs

SE_N metrics In theory the inverse of the median standard error of the beta estimates across all SNPs should be proportional to the square root of the sample size:

SEj^2=var(βj)=var(Y)/(Nj*var(Xj))

Where SEj^2 is the squared standard error the beta estimate for SNP j, βj is beta for SNP j, var(Y) is variance for the phenotype Y, Nj is sample size for SNP j and var(Xj) is genotype variance for SNP j.

Assuming that a given SNP sample size is close to max sample size for all SNPS:

median(SEj) = (sd(Y)/sqrt(N))median(1/var(Xj)) sqrt(N) = (csd(Y))/median(SEj) Where c = median(1/sqrt(var(Xj)))

7a. Calculate the following metrics:

N = reported sample size for study or max N across all SNPs N_rep_sqrt= sqrt(N) med_se = median(se) sd_Y = 1 # we assume variance is 1 MAF = minor allele frequency var_X = 2MAFj(1−MAFj) C = median(1/sqrt(var_X)) N_est_sqrt = (csd_Y)/med_se N_est= N_sqrt^2 sd_Y_est1 = (sqrt_Nmed_se)/c

z=b/se standardised_bhat = sqrt((z^2/(z^2+n-2)) / (2 maf (1-maf))) * sign(z) estimated_sd = b / standardised_bhat sd_Y_est2 = median(estimated_sd) ratio_se_N=N_est_sqrt/N_rep_sqrt

7b. Report the following metrics: N N_est N_est_sqrt N_rep_sqrt ratio_se_N sd_Y_est1 sd_Y_est2 sd_Y_rep #this is SD in Mr-Base study table

We expect ratio_se_N to be 1. When it is not 1 the following problems could apply: Study phenotype was not standardised. Variance of phenotype is not 1. the study’s phenotypic variance differs from other studies, which might be explained by a different study design or special study population; the study’s MAFs differ from other studies, which might be explained by a diverging genotyping platform, reference panel for the imputation, or a different ethnicity the study’s SNP imputation qualities differ from those of other studies, which might reflect errors in the imputation or a different reference panel; the study’s effective sample size differs from the stated sample size, which might be due to unaccounted relatedness between study participants or mis-coded sample size; the study analyst has used a different statistical test; or the study analyst has mis-specified the phenotype transformation or the regression model, which results in a different phenotype variance or residual variance

P-Z plot For each SNP, compare the reported P-values with the P-values computed from the Z-statistics based on reported beta-estimate and standard error (Z statistics = βj/SE (β)j)

AF plot Plot the allele frequency of each SNP against a reference allele frequency. Count number of SNPs in the GWAS where their AF deviates >20% from reference: abs(af_gwas-af_ref)/af_ref > 0.2 How many SNPs have mismatching alleles? E.g. AC in GWAS but AT in reference?

Plotting reported allele frequencies against a reference set, such as from the HapMap or 1000 Genomes projects, can help to visualize patterns that pinpoint strand issues, allele miscoding, or the inclusion of individuals whose self-reported ancestry does not match their genetic ancestry.

Genomic control / lambda GC-lambda = median(qchisq(pvals, df=1, low=FALSE)) / qchisq(0.5, 1, low=FALSE) GC-lambda greater than 1.1 require further investigation Population stratification can either inflate or deflate association P-values and can be grasped by the genomic control (GC) inflation factor (λGC). As λGC increases with sample size in the case of polygenic phenotypes. Interpretation of GC-lambda unclear when study used a targeted array.

Calculate r2 statistics var1=1 var2 = sd_Y_est r2_1 = 2b^2p(1-p)/var_1 r2_2 = 2b^2p(1-p)/var_2 r2_sum1=sum(r2_1) r2_sum2=sum(r2_2)

F = b^2/se^2

r2 = F / (F + N – 2)

Any studies with r2_sum >0.5 should be investigated very closely Studies with r2_sum>0.1 may also require investigation

How many SNPs had a Pvalues < 5e-8? N_p = length(which(pval<5e-8 ))

Where this number is “high” further investigation is required. Interpretation of this value depends on other metrics in the report, such as sample size, sum_r2 and GC-lambda.

Compare effect alleles with GWAS catalog Merge file with GWAS catalog on pmid. Where EAmrbase and EAgwascat different, flip the beta by *-1. Then compare BETAgwacat versus BETAmrbase. Produce two tables for rows that match, and for when they don’t match. Matching means the directions of effect are the same (the betas won’t necessarily be identical). How many match and how many don’t match?

QQ plot & manhattan plot

MRCIEU / opengwas-reports

QC metrics #2

F = b^2/se^2

r2 = F / (F + N – 2)