getian107 / PRScsx

Cross-population polygenic prediction
MIT License
69 stars 20 forks source link

How to input "VALIDATION_BIM_PREFIX (required)" #51

Closed Xi-Cao closed 1 month ago

Xi-Cao commented 1 month ago

Hi there, thanks for your work on developing PRS. I'm a little confused about the ”VALIDATION_BIM_PREFIX (required): Full path and the prefix of the bim file for the target (validation/testing) dataset. This file is used to provide a list of SNPs that are available in the target dataset.“

Should this BIM file only include variants present in all target samples? For instance, what if I input a larger BIM file from a dataset that contains my target sample, or if I simply use a 1KG BIM file for the corresponding population? Would this affect the posterior effect estimates?

Thanks, xicao

Xi-Cao commented 1 month ago

I thought the former would increase the computational burden, while the latter might result in missing variant effects. Is that correct?

getian107 commented 1 month ago

Hi Xicao - Ideally you would use a bim file that only includes variants that are available in the target dataset. In general, the software calculates posterior effect sizes for variants that are available in the GWAS summary statistics, LD reference panel (HapMap3 variants) and the bim file. Therefore, if the bim file includes variants that are in the GWAS and reference but not in the target set, then you will have missing variants in the scoring stage, which might impact prediction accuracy. This is usually not a big problem because most HapMap3 variants are included in well imputed datasets. But the prediction accuracy may reduce if a large number of variants in the posterior output are missing from the target set.

Xi-Cao commented 1 month ago

I understand. Thanks for your reply!