Not able to reproduce the best fit PRS for plink

ranijames commented 3 years ago

Hi Sam, Thanks for the great tutorial. I have been trying PLINK for the polygenic risk score. However, with the height dataset and EUR plink files, I am not able to reproduce the results. Especially, the one for best-PRS using linear regression model in R script.

choishingwan commented 3 years ago

what did you get?

I haven't keep the tutorial up to date lately and I know for example, the pre-QCed data for the subsequent data weren't updated.

On Thu, Sep 30, 2021 at 10:38 AM Alva Rani James @.***> wrote:

Hi Sam, Thanks for the great tutorial. I have been trying PLINK for the polygenic risk score. However, with the height dataset and EUR plink files, I am not able to reproduce the results. Especially, the one for best-PRS using linear regression model in R script.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRS-Tutorial/issues/27, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYV76BRV7IUG6RLYMV3UERY6TANCNFSM5FCQA5LA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ranijames commented 3 years ago

So for example. The best PRS according to the tutorial is 0.3 and what I have is 0.5 prs.result[which.max(prs.result$R2),] Threshold R2 P BETA SE 7 0.5 0.1634566 9.256151e-26 55830.85 5004.534 Ok, I see. I just wanna make sure that the whole steps mentioned are appropriate for analysis. I am following the steps for our in-house datasets. So before that as a validation of all steps, I used the provided GWAS summary file and plink datasets.

choishingwan commented 3 years ago

If I repeat the analysis stated in the tutorial using the provided data set (I re-downloaded everything to ensure it is correct), I still got the same result stated in the tutorial

Threshold R2 P BETA SE 5 0.3 0.1612372 2.77407e-25 45316.19 4107.777

And if I use PRSice with info filtering disabled, I will also get the same result. So you might want to double check

Sam

On Thu, Sep 30, 2021 at 11:18 AM Alva Rani James @.***> wrote:

So for example. The best PRS according to the tutorial is 0.3 and what I have is 0.5 prs.result[which.max(prs.result$R2),] Threshold R2 P BETA SE 7 0.5 0.1634566 9.256151e-26 55830.85 5004.534 Ok, I see. I just wanna make sure that the whole steps mentioned are appropriate for analysis. I am following the steps for our in-house datasets. So before that as a validation of all steps, I used the provided GWAS summary file and plink datasets.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRS-Tutorial/issues/27#issuecomment-931418888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYWJALFQQFJDDY2KJXTUER5WDANCNFSM5FCQA5LA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ranijames commented 3 years ago

Ok, thanks a lot for the update and for double-checking this. I appreciate your time and help. I can re-run once again. And make sure steps and same. I have converted the script into Snakemake. Let's see if I miss something.

ranijames commented 3 years ago

Hi Sam, I could now validate my output with what is documented. Thanks for your time and patience. I have a question. Do the base and target datasets are some different individual or same individuals/samples? I read they are from two sources target data is simulated from 1000 genome and base is from your own lab. I have understood the phenotype (base) dataset should correspond to the phenotype-genotype datasets (target) set, isn't it? In the paper, I see that both target and base datasets are independent datasets. In my case, my phenotype of interest is from a clinical trial study that we have done internally. The target is also from the same patients. Hence, I have both base and target datasets from the same patients, does that make sense?

choishingwan commented 3 years ago

You should never use the same sample for both the base and target

And the base data from the tutorial was from GIANT consortium with some modification

On Mon, 4 Oct 2021 at 6:49 AM, Alva Rani James @.***> wrote:

Hi Sam, I could now validate my output with what is documented. Thanks for your time and patience. I have a question. Do the base and target datasets are some different individual or same individuals/samples? I read they are from two sources target data is simulated from 1000 genome and base is from your own lab. I have understood the phenotype (base) dataset should correspond to the phenotype-genotype datasets (target) set, isn't it? I have base and target datasets from the same patients, does that make sense?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRS-Tutorial/issues/27#issuecomment-933364870, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYV2CGJ7L7AD24JZCLDUFGBCBANCNFSM5FCQA5LA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Dr Shing Wan Choi Instructor Genetics and Genomic Sciences Icahn School of Medicine, Mount Sinai, NYC

ranijames commented 3 years ago

I am confused then in our case we do not have a different base and target datasets. The 1 base dataset is Gwas output from plink on the same cohort. The target is the same cohort as well How does this similarity make a problem in the result? Also, we do not have a continuous phenotype we have the binary phenotype. So in that case is it fine to use our logistic regression for finding the best PRS fit?

On Mon 4. Oct 2021 at 13:30, Shing Wan Choi @.***> wrote:

You should never use the same sample for both the base and target

And the base data from the tutorial was from GIANT consortium with some modification

On Mon, 4 Oct 2021 at 6:49 AM, Alva Rani James @.***> wrote:

Hi Sam, I could now validate my output with what is documented. Thanks for your time and patience. I have a question. Do the base and target datasets are some different individual or same individuals/samples? I read they are from two sources target data is simulated from 1000 genome and base is from your own lab. I have understood the phenotype (base) dataset should correspond to the phenotype-genotype datasets (target) set, isn't it? I have base and target datasets from the same patients, does that make sense?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub < https://github.com/choishingwan/PRS-Tutorial/issues/27#issuecomment-933364870 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAJTRYV2CGJ7L7AD24JZCLDUFGBCBANCNFSM5FCQA5LA

. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

-- Dr Shing Wan Choi Instructor Genetics and Genomic Sciences Icahn School of Medicine, Mount Sinai, NYC

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRS-Tutorial/issues/27#issuecomment-933393989, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4I6JPVLOUA7JHQFYIMNYTUFGF5LANCNFSM5FCQA5LA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Sent from my iPad

choishingwan commented 3 years ago

See pitfall 1 in this paper: https://www.nature.com/articles/nrg3457

Yes, logistic regression for binary traits

ranijames commented 3 years ago

Thanks a lot for the paper. I have another question, is it possible to have a gene-based polygenic score than on each variant within each patient?

choishingwan commented 3 years ago

Do you mind elaborating? Do you mean you want to calculate PRS using only one gene?

You can use PRSet to calculate pathway specific scores, but that might be a bit different from a "gene" based PRS?

ranijames commented 3 years ago

Yes what I mean is we need a score for each gene. A weighted score. Currently from both tools we have score for each patients in each variants/SNP. If we collapse the genes based on their variants and run the analysis would that make sense? Or simply apply the formula for polygenic risk from Wikipedia on the collapse gene would that still make sense

https://wikimedia.org/api/rest_v1/media/math/render/svg/7da94c1dc4f882b5cb293ac8415cf9d94f8639b7

At the end we need score for each gene within each sample/individual I would like to hear your opinion on this ?

Thanks again for your valuable remarks.

Can be still used as polygenic risk score?

On Mon 4. Oct 2021 at 21:58, Shing Wan Choi @.***> wrote:

Do you mind elaborating? Do you mean you want to calculate PRS using only one gene?

You can use PRSet to calculate pathway specific scores, but that might be a bit different from a "gene" based PRS?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRS-Tutorial/issues/27#issuecomment-933808380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4I6JKVFCRHTH7BDNHIEUDUFIBPVANCNFSM5FCQA5LA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Sent from my iPad

choishingwan commented 3 years ago

It is implemented as PRSet. You can check our webpage.

Problem with going down to gene level is that each of the gene will likely explain such small amount of the phenotypic variance that it will likely not be useful. If you group that into pathway / gene sets, that might provide more power.

Sam

On Mon, Oct 4, 2021 at 4:14 PM Alva Rani James @.***> wrote:

Yes what I mean is we need a score for each gene. A weighted score. Currently from both tools we have score for each patients in each variants/SNP. If we collapse the genes based on their variants and run the analysis would that make sense? Or simply apply the formula for polygenic risk from Wikipedia on the collapse gene would that still make sense

https://wikimedia.org/api/rest_v1/media/math/render/svg/7da94c1dc4f882b5cb293ac8415cf9d94f8639b7

At the end we need score for each gene within each sample/individual I would like to hear your opinion on this ?

Thanks again for your valuable remarks.

Can be still used as polygenic risk score?

On Mon 4. Oct 2021 at 21:58, Shing Wan Choi @.***> wrote:

Do you mind elaborating? Do you mean you want to calculate PRS using only one gene?

You can use PRSet to calculate pathway specific scores, but that might be a bit different from a "gene" based PRS?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/choishingwan/PRS-Tutorial/issues/27#issuecomment-933808380 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AB4I6JKVFCRHTH7BDNHIEUDUFIBPVANCNFSM5FCQA5LA

. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

-- Sent from my iPad

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRS-Tutorial/issues/27#issuecomment-933820856, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYXVUTYSELLFPYISTRDUFIDMBANCNFSM5FCQA5LA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ranijames commented 3 years ago

Thanks again for your suggestions and time. Using for pathways enrichment meaning using those genes with a specific threshold for pathway enrichment analysis gives us more meaningful results? Is that you mean? Also what you specifically mean by “small amount “ of phenotypic risk score?

On Mon 4. Oct 2021 at 22:21, Shing Wan Choi @.***> wrote:

It is implemented as PRSet. You can check our webpage.

Problem with going down to gene level is that each of the gene will likely explain such small amount of the phenotypic variance that it will likely not be useful. If you group that into pathway / gene sets, that might provide more power.

Sam

On Mon, Oct 4, 2021 at 4:14 PM Alva Rani James @.***> wrote:

Yes what I mean is we need a score for each gene. A weighted score. Currently from both tools we have score for each patients in each variants/SNP. If we collapse the genes based on their variants and run the analysis would that make sense? Or simply apply the formula for polygenic risk from Wikipedia on the collapse gene would that still make sense

https://wikimedia.org/api/rest_v1/media/math/render/svg/7da94c1dc4f882b5cb293ac8415cf9d94f8639b7

At the end we need score for each gene within each sample/individual I would like to hear your opinion on this ?

Thanks again for your valuable remarks.

Can be still used as polygenic risk score?

On Mon 4. Oct 2021 at 21:58, Shing Wan Choi @.***> wrote:

Do you mind elaborating? Do you mean you want to calculate PRS using only one gene?

You can use PRSet to calculate pathway specific scores, but that might be a bit different from a "gene" based PRS?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/choishingwan/PRS-Tutorial/issues/27#issuecomment-933808380

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AB4I6JKVFCRHTH7BDNHIEUDUFIBPVANCNFSM5FCQA5LA

. Triage notifications on the go with GitHub Mobile for iOS <

https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android <

https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub

.

-- Sent from my iPad

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/choishingwan/PRS-Tutorial/issues/27#issuecomment-933820856 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAJTRYXVUTYSELLFPYISTRDUFIDMBANCNFSM5FCQA5LA

. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRS-Tutorial/issues/27#issuecomment-933825639, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4I6JOQINREUZOBOEN7CG3UFIEERANCNFSM5FCQA5LA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Sent from my iPad

choishingwan commented 3 years ago

Use pathway (collection of gene based on biochemical signalling or other biological processes) instead of individual genes

For most genome wide PRS, an R2 of 0.3 is already really nice. If you are using gene, which represent X% of the genome, your R2 is likely 0.3 * X% (maybe slightly higher than that). When you go down to gene level, X is going to be very small, thus your resulting R2 is likely to be too small to be useful

ranijames commented 3 years ago

Thanks a lot. Makes sense to me Thanks again for your time and Patience. On Mon 4. Oct 2021 at 22:50, Shing Wan Choi @.***> wrote:

Use pathway (collection of gene based on biochemical signalling or other biological processes) instead of individual genes

For most genome wide PRS, an R2 of 0.3 is already really nice. If you are using gene, which represent X% of the genome, your R2 is likely 0.3 * X% (maybe slightly higher than that). When you go down to gene level, X is going to be very small, thus your resulting R2 is likely to be too small to be useful

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRS-Tutorial/issues/27#issuecomment-933844735, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4I6JPN45ZA6T7HYW7RR23UFIHSDANCNFSM5FCQA5LA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Sent from my iPad

ranijames commented 3 years ago

By the way, where can find the reference to the plink pRS score formula mentioned in the documentation? I have searched for it in plink’s manuel could not find. Would be great if you could share the source Thanks

On Mon 4. Oct 2021 at 22:53, alva james @.***> wrote:

Thanks a lot. Makes sense to me Thanks again for your time and Patience.

On Mon 4. Oct 2021 at 22:50, Shing Wan Choi @.***> wrote:

Use pathway (collection of gene based on biochemical signalling or other biological processes) instead of individual genes

For most genome wide PRS, an R2 of 0.3 is already really nice. If you are using gene, which represent X% of the genome, your R2 is likely 0.3 * X% (maybe slightly higher than that). When you go down to gene level, X is going to be very small, thus your resulting R2 is likely to be too small to be useful

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRS-Tutorial/issues/27#issuecomment-933844735, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4I6JPN45ZA6T7HYW7RR23UFIHSDANCNFSM5FCQA5LA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Sent from my iPad

-- Sent from my iPad

choishingwan commented 3 years ago

our website has it prsice.info

choishingwan / PRS-Tutorial

Not able to reproduce the best fit PRS for plink #27