getian107 / PRScsx

Cross-population polygenic prediction
MIT License
65 stars 20 forks source link

how to get the final snp weight? #44

Closed nbj25 closed 7 months ago

nbj25 commented 7 months ago

Hi I have 3 populations and after running the prscsx, i got three separate snp weight, plus meta-weight i thought --meta weight is considered as final weight per snp that can be applied for different population regardless of ancestry. however, meta option is not recommended in the instruction.
if i calculate the PRS score per person and then fit the model in regression in linear combination (EURPRS+AFRPRS+ etc), i will get an overall beta coefficients given the phenotype. then how I can re-apply this leaned coefficient for each snp? i guess, i am a little confused what is the final weight for each snp [beta-per-snp] considering multiple ancestries? is this still --meta weight?

thanks

getian107 commented 7 months ago

Hi - Both the linear combination approach and the meta-analysis approach are valid. The linear combination approach can maximize the prediction accuracy in specific populations but requires an independent validation dataset to learn the linear combination weights. When a validation dataset is not available (due to e.g. limited total sample size in the target dataset), you can use the auto version of the algorithm + the meta-analysis approach which does not require a validation set. This will give you per-SNP weights that can be applied to any population.

To use the linear combination approach, you will calculate the score per individual in the validation dataset. You will then fit a linear regression like: phenotype ~ covariates + beta_EUR PRS_EUR + beta_AFR PRS_AFR + ... + an error term Fitting this regression will give you beta_EUR, beta_AFR, etc.

Then in the testing sample, the final PRS will be calculated as PRS = beta_EUR PRS_EUR + beta_AFR PRS_AFR + ...

Note that the beta's here are weights to combine population-specific PRS; not weights at the SNP level. You might also want to standardize the population-specific PRS so the weights are more transferable from the validation to the testing set.

nbj25 commented 7 months ago

thank you, so I am kind of new to this and just to clarify for option 2: PRS_EUR= obviously calculated by score-sum function of plink using EUR snp posterior weights output from prscx (that we also standardize the distribution per ancestry at next step) beta_EUR= i am not sure how to get this--perhaps calculated by first using single EUR ancestry model fitting like: phenotype~covariates+ PRS_EUR ? beta_EUR*PRS_EUR= does this mean to manually first multiply these two values as new value to be used in linear model?

best,

getian107 commented 7 months ago

Say you have a validation set that has N samples. You can calculate PRS_EUR, PRS_AFR, etc for each individual in this dataset. PRS_EUR is calculated using the plink score function and EUR posterior weights. Similarly PRS_AFR is calculated using AFR posterior weights.

You can then fit a multiple regression in the validation set: y ~ covariates + beta_EUR PRS_EUR + beta_AFR PRS_AFR + ... Here, y, PRS_EUR and PRS_AFR are Nx1 vectors, beta_EUR and beta_AFR are regression coefficients.

Then in an independent testing set that has say M samples, you can also calculate PRS_EUR, PRS_AFR, etc for each individual in this test set. You will then use the estimated beta_EUR and beta_AFR from the last step to calculate: PRS = beta_EUR PRS_EUR + beta_AFR PRS_AFR Here, PRS, PRS_EUR and PRS_AFR are Mx1 vectors, beta_EUR and beta_AFR are estimated in the validation set. This PRS will be the final score which you will evaluate its predictive performance in the testing set.

Does this make sense?

nbj25 commented 7 months ago

thank you for the response. I still need to understand this approach. I dont know how to get beta_EUR from the prscx output apart from what I wrote in number 2 above.
I think the option2, might only improve the PRS performance and R2 but doesnt change the weight per snp and the META output is still the only one that can be shared in scientific community as multi-ancestry snp weight. also does N is a mixed of ancestry or single ancestry?
basically you might consider provide these info in easier language under program instruction

regards,

best,

getian107 commented 7 months ago

Thanks for the feedback. We will provide more details on GitHub. More information can also be found in the PRS-CSx publication: https://www.nature.com/articles/s41588-022-01054-7

Let me give you a concrete example which hopefully will make things clear. Assume you have GWAS summary statistics from EUR and AFR populations, and you want to build and evaluate a PRS in an EAS cohort where you have phenotypic and genetic data for each individual. Step 1: Running PRS-CSx, which will give you two sets of per-SNP weights, corresponding to the EUR and AFR GWAS, respectively Step 2: Using the two sets of SNP weights, you can calculate two PRS for each individual in the EAS cohort, PRS_EUR and PRS_AFR Step 3: In the EAS cohort, you can then fit a regression: y ~ covariates + beta_EUR PRS_EUR + beta_AFR PRS_AFR where beta_EUR and beta_AFR are regression coefficients. Fitting this regression will give you estimates of beta_EUR and beta_AFR.

You can then share the following information with the scientific community: (i) the two sets of SNP weights (which will be used to generate PRS_EUR and PRS_AFR) AND (ii) estimated beta_EUR and beta_AFR (which tells how PRS_EUR and PRS_AFR can be linearly combined to predict the phenotype in EAS).

Now if you want to build another PRS for an admixed American sample, you repeat the same steps. The two sets of SNP weights don't change but beta_EUR and beta_AFR will change. In other words, beta_EUR and beta_AFR are target-population-specific. Intuitively, if the target population is genetically closer to EUR, you will give more weights to PRS_EUR; if the target population is closer to AFR, you will give more weights to PRS_AFR when doing the linear combination.

Hope this helps.

nbj25 commented 7 months ago

thank you very much...this is much more clear now and yes, i would encourage you to add this with one toy-set in the instruction. 👍

nbj25 commented 7 months ago

Hi i am also adding two linear regression results below--may be you can add some suggestions? first one is the --meta option PRS z-score and second one is adding all ancestries into the linear model. the target population is African. the summation of the beta in the latter gives a total beta of 0.11 (zafr+aeur+zamr+zeas+zsas) while the meta option, only gives a beta of zmain=0.01. please see the R-square as well

. regress zvitd pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 bmi age sex zmain

  Source |       SS           df       MS      Number of obs   =     3,159

-------------+---------------------------------- F(14, 3144) = 13.61 Model | 180.391099 14 12.8850785 Prob > F = 0.0000 Residual | 2977.60877 3,144 .94707658 R-squared = 0.0571 -------------+---------------------------------- Adj R-squared = 0.0529 Total | 3157.99987 3,158 .999999958 Root MSE = .97318


   zvitd | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]

-------------+---------------------------------------------------------------- pc1 | -.0008051 .0005774 -1.39 0.163 -.0019371 .000327 pc2 | -.0015021 .0013236 -1.13 0.257 -.0040972 .001093 pc3 | -.0003068 .0024325 -0.13 0.900 -.0050763 .0044628 pc4 | -.0038745 .0032219 -1.20 0.229 -.0101918 .0024427 pc5 | -.0071537 .0067753 -1.06 0.291 -.0204382 .0061307 pc6 | .0061869 .0047728 1.30 0.195 -.0031712 .0155451 pc7 | -.0021542 .0011263 -1.91 0.056 -.0043625 .0000541 pc8 | .0016605 .0012375 1.34 0.180 -.0007658 .0040868 pc9 | -.0000741 .0055911 -0.01 0.989 -.0110367 .0108886 pc10 | -.0007993 .0043861 -0.18 0.855 -.0093991 .0078006 bmi | -.0165841 .0033111 -5.01 0.000 -.0230763 -.010092 age | .0209378 .002165 9.67 0.000 .0166928 .0251828 sex | .1457712 .0359209 4.06 0.000 .0753404 .2162021 zmain | .0126331 .0174282 0.72 0.469 -.0215386 .0468048 _cons | -.7319241 .2256039 -3.24 0.001 -1.17427 -.2895783

. regress zvitd pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 bmi age sex zafr zeur zsas zeas zamr

  Source |       SS           df       MS      Number of obs   =     3,159

-------------+---------------------------------- F(18, 3140) = 11.73 Model | 198.892503 18 11.0495835 Prob > F = 0.0000 Residual | 2959.10736 3,140 .94239088 R-squared = 0.0630 -------------+---------------------------------- Adj R-squared = 0.0576 Total | 3157.99987 3,158 .999999958 Root MSE = .97077


   zvitd | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]

-------------+---------------------------------------------------------------- pc1 | -.0010733 .0005945 -1.81 0.071 -.002239 .0000923 pc2 | -.0014577 .0013348 -1.09 0.275 -.0040749 .0011596 pc3 | -.0000762 .0024301 -0.03 0.975 -.004841 .0046886 pc4 | -.0040007 .0032198 -1.24 0.214 -.0103138 .0023123 pc5 | -.0070209 .006766 -1.04 0.300 -.0202872 .0062453 pc6 | .0055502 .004777 1.16 0.245 -.0038161 .0149165 pc7 | -.0009017 .0011661 -0.77 0.439 -.0031882 .0013847 pc8 | .0019597 .0012381 1.58 0.114 -.0004678 .0043873 pc9 | .0006616 .005581 0.12 0.906 -.0102812 .0116044 pc10 | -.0021785 .004388 -0.50 0.620 -.0107821 .0064251 bmi | -.0164395 .0033031 -4.98 0.000 -.022916 -.0099631 age | .0208493 .0021646 9.63 0.000 .0166051 .0250934 sex | .1447575 .0358511 4.04 0.000 .0744636 .2150514 zafr | .0785994 .0185223 4.24 0.000 .0422825 .1149164 zeur | -.0052748 .017376 -0.30 0.761 -.0393444 .0287947 zsas | .0166476 .0180458 0.92 0.356 -.0187351 .0520304 zeas | .0133731 .0183912 0.73 0.467 -.0226868 .049433 zamr | .0142227 .0180907 0.79 0.432 -.021248 .0496935 _cons | -.6668099 .2292541 -2.91 0.004 -1.116313 -.2173068

getian107 commented 7 months ago

The regression coefficients may be difficult to interpret and we usually focus on prediction metrics such as R2, AUC, etc. The R2 estimates of the two models here look reasonable.

nbj25 commented 7 months ago

Great, Thank you very much!