getian107 / PRScsx

Cross-population polygenic prediction
MIT License
65 stars 20 forks source link

specify the bim file #11

Closed kaibios0101 closed 3 years ago

kaibios0101 commented 3 years ago

Hi, Tian. Thank you for your valuable tool for PRS analysis.

The PRS-csx requires a bim file from the testing/validation set in the PRScsx.py. We know that the bim file from the testing set generally has a larger number of variants (~ 5M or more) than the SNP information (~1M) implemented in the tool, and only the variants presented in the testing set and SNP informaiton will return. If we pre-format the testing set to include only variants that appeared in the SNP information, this procedure can largely reduce the computational burden or memory. Do the additional procedure affect the beta estimation or prediction accuracy?

Bests,

Kai

getian107 commented 3 years ago

Hi Kai- PRS-CS and PRS-CSx only use HapMap3 SNPs that are shared between the GWAS summary statistics, LD reference panel and the target dataset. Pre-filtering the target dataset to HapMap3 SNPs does not affect effect size estimation. I'm not sure this procedure significantly reduces computational burden though. Finding shared SNPs across the three files usually only takes a couple of minutes even if the summary statistics and target dataset contain millions of SNPs. Individual-level data of the target dataset is only used in the PRS scoring step, and PLINK's --score function is quite computationally efficient.

kaibios0101 commented 3 years ago

Hi Tian. I understand your explanation. Thank you!

Bests,

Kai.