AshleyLab / risk_scores

LD Pred risk scores for afib
0 stars 0 forks source link

Important covariates to include in logistic regression #7

Open jackosullivanoxford opened 5 years ago

jackosullivanoxford commented 5 years ago
jackosullivanoxford commented 5 years ago

And then use the complete ICD matrix to determine who (which PATID) have the relevant CHADSVASc phenotypes/risk factors (add them as 1,0 columns to phenotype.tsv file (../step3_with_covar). The ICD matrix is here: icd_matrix_complete.txt.gz (location: /oak/stanford/groups/euan/projects/ukbb/code/anna_code/icd) (code to unzip: anna_ICD=fread("gunzip -c icd_matrix_complete.txt.gz").

jackosullivanoxford commented 5 years ago

This paper (https://www.ncbi.nlm.nih.gov/pubmed/?term=16152135) argues that treatment should not be used as a binary covariate in the regression model and that excluding treated participants should also not occur and instead one of the following two methods should be used: 1. Estimate the outcome risk in the treated population (as if this population wasn't treated) - to do this use estimates of annual stroke risk for different CHADSVASc scores (as per this website: https://www.chadsvasc.org/) 2. Use censored normal regression model (explained in paper).

Ultimately, I think it is worth doing (to show change in results):

  1. Logistic regression with only PCA (first 4 components), age, sex covariates, and genotype array used (this is what Khera did)
  2. Logistic regression with PCA, age, sex covariates, treatment (warfarin and NOAC) covariates
  3. Logistic regression with PCA, age, sex covariates, treatment (warfarin and NOAC) and full CHADSVASc covariates
  4. Logistic regression with PCA, age, sex covariates, and full CHADSVASc covariates (both as individual covariates ?and potentially as a CHADSVASc accumulative score) (no treatment)
  5. Regression with PCA, age, sex covariates, treatment (warfarin and NOAC) and full CHADSVASc covariates using censored normal regression.
  6. Logistic regression with PCA, age, sex covariates, treatment (warfarin and NOAC) and full CHADSVASc covariates with some stroke outcomes imputed, e.g. if there are 100 treated patients and the stroke risk is 4% over X years, add 4 ischemic strokes (THINK ABOUT THIS, e.g. who should we assign to stroke: ?random, ?evenly amongst risk scores)

*Also repeat the above analyses with other thromboembolic outcomes in phenotype file.

Once I have done steps 1-4, think about emailing lead author of the stat med paper (https://twitter.com/Martin_Tobin) to seek advice.

jackosullivanoxford commented 5 years ago

Make 2 dataframes:

Df with column headings of: PATID, Phen, age, sex, PC1 (consider adding others), CHADSVASc variables, warfarin, ICD anticoagulation Df with the above columns and PRS