choishingwan / PRS-Tutorial

A tutorial on how to run basic polygenic risk score analysis
MIT License
70 stars 110 forks source link

Question about validation set #17

Closed tengfeixiaozhu closed 3 years ago

tengfeixiaozhu commented 3 years ago

Dear Shing Wan Choi

I do the analysis as the paper. I download base data (GWAS summary), and obtain train set and validation set. I calculate the PRS, and then trained the logistic model based on train dataset. However the AUC in validation set is much lower than that in train set. I want to know the auc of the validation set should be calculated by which logistic model, trained by train set or validation set itself?

Looking foward for your reply! Yours, Yanhua Wen

choishingwan commented 3 years ago

You mean, on your data? I don’t remember generating a validation data set

Normally it isn’t too surprising to see a reduction in performance in the valuation data because our results are usually overfited

On Fri, 20 Nov 2020 at 4:16 AM, tengfeixiaozhu notifications@github.com wrote:

Dear Shing Wan Choi

I do the analysis as the paper. I download base data (GWAS summary), and obtain train set and validation set. I calculate the PRS, and then trained the logistic model based on train dataset. However the AUC in validation set is much lower than that in train set. I want to know the auc of the validation set should be calculated by which logistic model, trained by train set or validation set itself?

Looking foward for your reply! Yours, Yanhua Wen

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRS-Tutorial/issues/17, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYRJ22W476YVJTD3DQDSQYXXBANCNFSM4T4QME7Q .

-- Dr Shing Wan Choi Postdoctoral Fellow Genetics and Genomic Sciences Icahn School of Medicine, Mount Sinai, NYC

tengfeixiaozhu commented 3 years ago

Thank you very much for your prompt reply.

Yes, on my data. I do the analysis as you tutorial, while your papar just mentioned validation data set to assess performance briefly. I read a paper published in NG, the method looks more like the idea in your paper. However, the auc in train set and validation set are the same.

I don't know what my problem is. Firstly, I calculated the PRS in train set and validation set. And then I train the logistic model by PRS by trained data, and test it by the validation dataset. The AUC are similar. Nextly, I train the logistic regression model by PRS adjusted for age, sex, and the first four principal components of ancestry (plink --pca ) based on the train dataset. The AUC in train set was determined by 10-fold cross-validation. While I applied the logistic model trained by train set to my independent validation set, the auc is low. Here the principal components are obtained indenpendently from train and validation set.

As the GWAS summary is downloaded and the auc are similar between train set and validation set when the logistic model only trained by PRS, I think the calculation for PRS is right.

Now I think the question could be, 1) I should calculate the principal components in the train and validatation mixed samples. 2) I should trained an extra logistic model by validation data set. Could you give me some help?

I don't have anyone around who studies SNPs. So I am afraid to misunderstand some details. Looking forward to your reply.

choishingwan commented 3 years ago

You don't need to determine the AUC in the training set if you have an independent sample. I think you can just do it once in the training, then use the same parameter to test in the independent samples. When you report the AUC, make sure you report two number, one with the PRS and and one without so that you can tell the performance of PRS, hope this make sense.

tengfeixiaozhu commented 3 years ago

It helps me a lot. Thank you very much.