Open jackosullivanoxford opened 5 years ago
And then use the complete ICD matrix to determine who (which PATID) have the relevant CHADSVASc phenotypes/risk factors (add them as 1,0 columns to phenotype.tsv file (../step3_with_covar). The ICD matrix is here: icd_matrix_complete.txt.gz (location: /oak/stanford/groups/euan/projects/ukbb/code/anna_code/icd) (code to unzip: anna_ICD=fread("gunzip -c icd_matrix_complete.txt.gz").
This paper (https://www.ncbi.nlm.nih.gov/pubmed/?term=16152135) argues that treatment should not be used as a binary covariate in the regression model and that excluding treated participants should also not occur and instead one of the following two methods should be used: 1. Estimate the outcome risk in the treated population (as if this population wasn't treated) - to do this use estimates of annual stroke risk for different CHADSVASc scores (as per this website: https://www.chadsvasc.org/) 2. Use censored normal regression model (explained in paper).
Ultimately, I think it is worth doing (to show change in results):
*Also repeat the above analyses with other thromboembolic outcomes in phenotype file.
Once I have done steps 1-4, think about emailing lead author of the stat med paper (https://twitter.com/Martin_Tobin) to seek advice.
Make 2 dataframes:
Df with column headings of: PATID, Phen, age, sex, PC1 (consider adding others), CHADSVASc variables, warfarin, ICD anticoagulation Df with the above columns and PRS