getian107 / PRScs

Polygenic prediction via continuous shrinkage priors
MIT License
156 stars 58 forks source link

HapMap Variants #47

Closed rameez500 closed 2 years ago

rameez500 commented 2 years ago

Hi Tian, I downloaded the LD reference panels and extracted files from the link: https://github.com/getian107/PRScs

I believe it provides approximately 1.2 HapMap variants in the reference panel. I have about 5 million in target dataset and 10 million in summary statistics. Just wanted to confirm, no matter how large the target and summary statistic dataset is, we can’t exceed the 1.2 HapMap variants in the PRS-cs tool, is that correct?

If not, is it possible for you to let me know how can we estimate SNP effect for more than 1.2 HapMap variants from the PRS-cs tool ?

Thank you!

getian107 commented 2 years ago

Hi- Yes PRS-CS always intersects the GWAS summary statistics with the HapMap3 reference panels. The primary reasons are (i) HapMap variants are often well imputed and tag majority of common genetic variation; (ii) using multi-million variants can be computationally expensive; (iii) going beyond HapMap variants doesn't necessarily produce better predictions (theoretically it should but in practice prediction performance may not robustly increase due to less good model convergence, difficulty of modeling many highly corrected variants etc). You could build your own reference panels that use a different set of variants than HapMap3 but in most applications the released HapMap reference panels should work reasonably well.

rameez500 commented 2 years ago

Thanks for the quick response.

The overlap between 1.2 million HapMap and my target dataset is around 400,000 variants. This is a low overlap, so I was looking at other HapMap references with around 1.6 M variants. Link: https://www.rdocumentation.org/packages/bigsnpr/versions/1.9.11/topics/download_1000G

With this new reference, the overlapping variants between target and HapMap variants increases to 600,000 variants.

Do you think it is a considerable increase from 400,000 and 600,000 variants, or would it make any difference in polygenic score calculation?

getian107 commented 2 years ago

The reference panel you referred to seem to be a combination of HapMap and UK Biobank variants. I think increasing from 400K to 600K variants is a fairly big increase and may improve prediction. Is your target dataset imputed? Usually we see a larger overlap with HapMap variants for well-imputed datasets (unless it's a genotyping chip that doesn't have a good genome-wide coverage such as the psychchip).

rameez500 commented 2 years ago

Thanks again for your response. I am sorry for any confusion.

There are around 400K variants overlapping between the 1.2 million HapMap and 12 million variants of summary statistics.

There are around 1.1 million variants that overlaps between 1.2 million HapMap and 5 million variants of imputed target dataset.

I assume that summary statistics are not well imputed. Do you think 400K variants is good enough in polygenic score calculation?

getian107 commented 2 years ago

400K is a reasonable number to proceed. The prediction will probably be less good than if you had more variants but it's a reasonable start point to see if common variants have any prediction value to the disease/trait of interest.