Quality control of Base and Target Data

jackosullivanoxford commented 5 years ago

This describe the necessary quality control measures to perform polygenic risk scores. I have followed this BioRx guide.

The issues to consider are as follows:

Standard GWAS quality control measures (e.g. removing SNPs according to low genotyping rate, minor allele frequency or imputation ‘info score’ and individuals with low genotyping rate) for both GWAS SS and target data (in my case UKBB).
File transfer: Ensure that files have not been corrupted during transfer. Use md5sum to do this.
Genome Build: Ensure that the base and target data SNPs have genomic positions assigned on the same genome build [32]. LiftOver (PMID: 20959295) is an excellent tool for standardizing genome build across different data sets.
Effect allele: Determine which allele in the GWAS SS is the effect allele.
Ambiguous SNPs: If the base and target data were generated using different genotyping chips and the chromosome strand (+/-) for either is unknown, then it is not possible to match ambiguous SNPs (i.e. those with complementary alleles, either C/G or A/T) across the data sets, because it will be unknown whether the base and target data are referring to the same allele or not. While allele frequencies can be used to infer which alleles match [34], we recommend removing all ambiguous SNPs.
Duplicate SNPs: Ensure that there are no duplicated SNPs in either the base or target data.
Sex-check: Do not include sex chromosomes
Sample-overlap: Do any of the individuals in the GWAS SS overlap with individuals in the UKBB
Relatedness: As per LDpred wiki (https://github.com/bvilhjal/ldpred/wiki/Q-and-A): "Relatedness in the validation/target sample is not a concern, however it is a concern for the LD reference panel." # Related individuals have been removed from ldpred reference panel
Heritability check: A critical factor in the accuracy and predictive power of PRS is the power of the base GWAS data [4], and so to avoid reaching misleading conclusions from the application of PRS we recommend first performing a heritability check of the base GWAS data. We suggest using a software such as LD Score regression [8] or LDAK [37] to estimate chip heritability from the GWAS summary statistics, and recommend caution in interpretation of PRS analyses that are performed on GWAS with a low chip-heritability estimate (eg. hsnp2 188 < 0.05).

jackosullivanoxford commented 5 years ago

Quality control issues 1: Quality control -The GWAS SS from study (Malik et al, PMID: 29531354) underwent standard quality control as outlined by Winkler et al (PMID: 24762786) - note that this is meta-analysis level quality control. Malik et al also did individual study level quality filtering: "Individual study-level filters were set to remove extreme effect values (β > 5 or β <−5), rare SNPs (MAF <0.01) and variants with low imputation accuracy (oevar_imp or info score <0.5). The effective allele count was defined as twice the product of the MAF, imputation accuracy (r2, info score or oevar_imp), and number of cases. Variants with an effective allele count <10 were excluded."

-Our target data (UKBB) was created in PLINK, which, as per the BioRx guide, is an appropriate and standard procedure for quality control.

jackosullivanoxford commented 5 years ago

Quality control issue 2: File Transfer

-I have done this using md5sum and the file was not corrupted during transfer.

jackosullivanoxford commented 5 years ago

Quality control issue 3: Genome build

-To do

jackosullivanoxford commented 5 years ago

Quality control issue 4: Effect allele

-Make sure that the effect allele in the GWAS SS is clear. Done: We have arranged the GWAS SS to the same columns as what is required for LDPred. Below is a list of the required format of GWAS SS for step 1 of LDpred (left side) and what format the MEGASTROKE GWAS SS were in:

Required format for LDpred - MEGASTROKE chr - (Not present) pos - (Not present) ref - Allele2 alt - Allele1 (the is the effect allele Reffrq (Frequency of the ref allele) - (1 - Freq1) (*Freq1 is the frequency of the effect allele). info - (Not present), but info is a dummy variable that can be set to 1 rs - Markername pval - P-value effalt - Effect (effect of Allele1) Not present - StdErr

*The original location of this above table is here.

jackosullivanoxford commented 5 years ago

Quality control issue 5: Ambiguous SNPs

-We removed all SNPs that didn’t have identical rsIDs (see /oak/stanford/groups/euan/projects/ukbb/code/anna_code/risk_scores/Merging_bim_GWAS_SS.py).

jackosullivanoxford commented 5 years ago

Quality control issue 6: Duplicate SNPs

-I have checked and there are no duplicate SNPs Code to do this: dup <- duplicated(ldpred_ss$rs) table(dup)["TRUE"] # Gives NA table(dup)["FALSE"] # Gives total number 764,0175

jackosullivanoxford commented 5 years ago

Quality control issue 7: Sex chromosomes

-I have checked our PLINK files and we have not included chromosome 23. The relevant PLINK files are located here: /oak/stanford/groups/euan/projects/ukbb/code/anna_code/risk_scores/step1_inputs

jackosullivanoxford commented 5 years ago

Quality control issue 8: Sample-overlap: I have done this and there is no overlap.

jackosullivanoxford commented 5 years ago

Quality control issue 9: As per LDpred wiki (https://github.com/bvilhjal/ldpred/wiki/Q-and-A): "Relatedness in the validation/target sample is not a concern, however it is a concern for the LD reference panel." # Related individuals have been removed from ldpred reference panel

jackosullivanoxford commented 5 years ago

Quality control issue 10: TO DO

AshleyLab / risk_scores

Quality control of Base and Target Data #4