Important covariates to include in logistic regression

jackosullivanoxford commented 5 years ago

Patients on warfarin at enrollment and those that were put on during follow-up. Note that we only want to count warfarin/NOAC if they took medication before ischemic stroke event (not events that occurred after an ischemic stroke).
Look at patients where warfarin and/or anticoagulants added between assessment centers
Patients on NOACs at enrollment (none I believe) and those that were put on during follow-up. Same as above

jackosullivanoxford commented 5 years ago

All of the codes for CHADSVASc covariates: I have done this and it is here: https://docs.google.com/spreadsheets/d/1Z5Q_2iTpNi3CvP1FFQSnSij670z44UPaxG3KTHGrb68/edit#gid=0

And then use the complete ICD matrix to determine who (which PATID) have the relevant CHADSVASc phenotypes/risk factors (add them as 1,0 columns to phenotype.tsv file (../step3_with_covar). The ICD matrix is here: icd_matrix_complete.txt.gz (location: /oak/stanford/groups/euan/projects/ukbb/code/anna_code/icd) (code to unzip: anna_ICD=fread("gunzip -c icd_matrix_complete.txt.gz").

jackosullivanoxford commented 5 years ago

This paper (https://www.ncbi.nlm.nih.gov/pubmed/?term=16152135) argues that treatment should not be used as a binary covariate in the regression model and that excluding treated participants should also not occur and instead one of the following two methods should be used: 1. Estimate the outcome risk in the treated population (as if this population wasn't treated) - to do this use estimates of annual stroke risk for different CHADSVASc scores (as per this website: https://www.chadsvasc.org/) 2. Use censored normal regression model (explained in paper).

Ultimately, I think it is worth doing (to show change in results):

Logistic regression with only PCA (first 4 components), age, sex covariates, and genotype array used (this is what Khera did)
Logistic regression with PCA, age, sex covariates, treatment (warfarin and NOAC) covariates
Logistic regression with PCA, age, sex covariates, treatment (warfarin and NOAC) and full CHADSVASc covariates
Logistic regression with PCA, age, sex covariates, and full CHADSVASc covariates (both as individual covariates ?and potentially as a CHADSVASc accumulative score) (no treatment)
Regression with PCA, age, sex covariates, treatment (warfarin and NOAC) and full CHADSVASc covariates using censored normal regression.
Logistic regression with PCA, age, sex covariates, treatment (warfarin and NOAC) and full CHADSVASc covariates with some stroke outcomes imputed, e.g. if there are 100 treated patients and the stroke risk is 4% over X years, add 4 ischemic strokes (THINK ABOUT THIS, e.g. who should we assign to stroke: ?random, ?evenly amongst risk scores)

*Also repeat the above analyses with other thromboembolic outcomes in phenotype file.

Once I have done steps 1-4, think about emailing lead author of the stat med paper (https://twitter.com/Martin_Tobin) to seek advice.

jackosullivanoxford commented 5 years ago

Make 2 dataframes:

Df with column headings of: PATID, Phen, age, sex, PC1 (consider adding others), CHADSVASc variables, warfarin, ICD anticoagulation Df with the above columns and PRS

AshleyLab / risk_scores

Important covariates to include in logistic regression #7