Open jackosullivanoxford opened 5 years ago
Quality control issues 1: Quality control -The GWAS SS from study (Malik et al, PMID: 29531354) underwent standard quality control as outlined by Winkler et al (PMID: 24762786) - note that this is meta-analysis level quality control. Malik et al also did individual study level quality filtering: "Individual study-level filters were set to remove extreme effect values (β > 5 or β <−5), rare SNPs (MAF <0.01) and variants with low imputation accuracy (oevar_imp or info score <0.5). The effective allele count was defined as twice the product of the MAF, imputation accuracy (r2, info score or oevar_imp), and number of cases. Variants with an effective allele count <10 were excluded."
-Our target data (UKBB) was created in PLINK, which, as per the BioRx guide, is an appropriate and standard procedure for quality control.
Quality control issue 2: File Transfer
-I have done this using md5sum and the file was not corrupted during transfer.
Quality control issue 3: Genome build
-To do
Quality control issue 4: Effect allele
-Make sure that the effect allele in the GWAS SS is clear. Done: We have arranged the GWAS SS to the same columns as what is required for LDPred. Below is a list of the required format of GWAS SS for step 1 of LDpred (left side) and what format the MEGASTROKE GWAS SS were in:
Required format for LDpred - MEGASTROKE chr - (Not present) pos - (Not present) ref - Allele2 alt - Allele1 (the is the effect allele Reffrq (Frequency of the ref allele) - (1 - Freq1) (*Freq1 is the frequency of the effect allele). info - (Not present), but info is a dummy variable that can be set to 1 rs - Markername pval - P-value effalt - Effect (effect of Allele1) Not present - StdErr
*The original location of this above table is here.
Quality control issue 5: Ambiguous SNPs
-We removed all SNPs that didn’t have identical rsIDs (see /oak/stanford/groups/euan/projects/ukbb/code/anna_code/risk_scores/Merging_bim_GWAS_SS.py).
Quality control issue 6: Duplicate SNPs
-I have checked and there are no duplicate SNPs Code to do this: dup <- duplicated(ldpred_ss$rs) table(dup)["TRUE"] # Gives NA table(dup)["FALSE"] # Gives total number 764,0175
Quality control issue 7: Sex chromosomes
-I have checked our PLINK files and we have not included chromosome 23. The relevant PLINK files are located here: /oak/stanford/groups/euan/projects/ukbb/code/anna_code/risk_scores/step1_inputs
Quality control issue 8: Sample-overlap: I have done this and there is no overlap.
Quality control issue 9: As per LDpred wiki (https://github.com/bvilhjal/ldpred/wiki/Q-and-A): "Relatedness in the validation/target sample is not a concern, however it is a concern for the LD reference panel." # Related individuals have been removed from ldpred reference panel
Quality control issue 10: TO DO
This describe the necessary quality control measures to perform polygenic risk scores. I have followed this BioRx guide.
The issues to consider are as follows: