getian107 / PRScs

Polygenic prediction via continuous shrinkage priors
MIT License
151 stars 58 forks source link

How many overlap SNPs after running LD reference panel is considered common? #25

Closed jasmine9764 closed 3 years ago

jasmine9764 commented 3 years ago

Dear researchers, Thank you for developing such an amazing tool. We recently conducted PRS-CS among UK Biobank (original genotypes without imputation), among 400,000+ SNPs, only 140,000 left for analysis after using European LD reference panel as reference.

The performance among testing data was unexpectedly good. We understand the purpose of imputation, however, we would like to know how many SNPs were left out in your experience when using data from the consortium? I did not find the information from the paper. Thank you in advance.

getian107 commented 3 years ago

Hi- PRS-CS uses HapMap3 SNPs for prediction. The number of overlapping SNPs between UKBB genotyped data and HapMap3 (i.e., 140K SNPs) looks correct. We would recommend using UKBB imputed data which will raise the number of SNPs used for prediction to ~1 million.

jasmine9764 commented 3 years ago

Thank you for the prompt reply!

Fiwx commented 1 year ago

Hi Tian, Thanks for building this remarkable program. I am also getting very low overlap between summary statistics and reference panel. There are ~10,000 total SNPs in the summary statistics and I used the same summary statistics file to make the .bim (so they have the same SNPs). I also tried using a different very large .bim to include more variants, but I don't see how that would result in an improvement and it didn't result in an improvement. After generating the new weights using PRS-CS-auto, I'm planning to score a number of individual genotypes using the new weights.

I entered this code: python PRScs.py --ref_dir=data/ldblk_1kg_eur --bim_prefix=data/BimFromSumStats --sst_file=data/SumStats.tsv --n_gwas=1000000 --out_dir=output/runs

And got this output (for example):

... 92617 SNPs on chromosome 1 read from test_data/ldblk_1kg_eur/snpinfo_1kg_hm3 ... ... 680 SNPs on chromosome 1 read from test_data/BimFromSumStats.bim ... ... 146 common SNPs in the reference, sumstats, and validation set ... [I.e., 146/680 = 21% of SNPs remaining.]

... 16464 SNPs on chromosome 22 read from test_data/ldblk_1kg_eur/snpinfo_1kg_hm3 ... ... 124 SNPs on chromosome 22 read from test_data/BimFromSumStats.bim ... ... 32 common SNPs in the reference, sumstats, and validation set ... [I.e., 26% of SNPs remaining.]

There are ~18-28% of SNPs remaining commonly across all chromosomes. This seems low to me, because I would expect if there were only a minority of SNPs being used, accuracy would be impaired. Is this normal, or have I done something wrong? What is the cause of the low overlap? Assuming it is normal, is it correct that the prediction accuracy of PRS-CS-auto over P+T (https://www.nature.com/articles/s41467-019-09718-5/figures/2) is despite the fact that P+T (or unadjusted PRS weights) would have significantly higher overlap/more SNPs?

Thank you.

getian107 commented 1 year ago

Hi- I think the problem is that you started with a set of summary statistics that has already been pruned/clumped with only 10K variants left. The ideal input for PRS-CS is a set of full summary statistics without any pruning. The benefit of PRS-CS over P+T comes from appropriate modeling of LD. If the summary statistics has been pruned, then all SNPs are largely independent with each other and thus PRS-CS probably wouldn't show much advantage over naive methods. Also PRS-CS uses HapMap3 variants to make prediction. Pruning does not prioritize HapMap3 variants and thus leads to low overlap between summary stats and the PRS-CS reference panel. If you only have access to a pruned set of summary stats, then P+T is probably is method to use.

Fiwx commented 1 year ago

Thank you for the helpful reply. I have non-pruned summary statistics so I will use those instead.