getian107 / PRScsx

Cross-population polygenic prediction
MIT License
65 stars 20 forks source link

How to Choose Validation and Testing dataset #49

Open ciel1021 opened 1 month ago

ciel1021 commented 1 month ago

Hi Dr. Ge,

I am working on the PRScsx integrating target data (AFR, n=2000) and base data from EUR and AFR). I am not sure I fully understand how to split the target dataset into validation and test datasets. Should I do 20/80 splitting before doing the calculation for different PHI?

Or should I just get the prscsx score first, and then separate it using train_test_split() in python to separate 20% as validation and 80% as testing?

Is that possible in the end, I can calculate the PRS with the whole dataset instead of the test dataset? I am afraid of the power of the analysis due to small sample size.

I believe the value of PRScsx, but I have some hard time to work through. thank you in advance.

getian107 commented 1 month ago

Hi -- There are two options: (i) Run the 'auto' and 'meta' version of PRS-CSx on the base GWAS, and evaluate the PRS on the entire AFR target dataset. In this case all model parameters are automatically learnt and there is no need to split the target dataset. (ii) Run PRS-CSx on the base GWAS using different phi. You would then need to split the AFR target dataset into validation and testing. You would use the validation set to select the best phi value, and evaluate the final PRS in the test set.

ciel1021 commented 1 month ago

Hi, Dr.Ge, I appreciate your quick response. Your advise helps me a lot. In addition, I have another concern regarding clumping. I briefly going through the script and I feel like I didn't see the clumping step. Do you think we need to do it before performing the prscsx analysis? Thank you again for your help.

getian107 commented 1 month ago

No - clumping is not needed. All Bayesian PRS methods try to include all variants and explicitly model LD.