choishingwan / PRSice

A software package for calculating, applying, evaluating and plotting the results of polygenic risk scores
http://prsice.info
GNU General Public License v3.0
182 stars 86 forks source link

Question about binary-target and clumping parameter #161

Closed ChongWu-Biostat closed 4 years ago

ChongWu-Biostat commented 4 years ago

I have one question about binary-target. If the effect sizes provided by GWAS summary data is in log(OR) scale, do we need to set --binary-target F?

For clump-p: If we set --clump-p 0.1, does it mean we only consider the SNPs with pvalue less than 0.1? I found using all SNPs are extremely slow for UK Biobank data and wonder if set --clump-p 0.1 is reasonable.

Thank you for your help!

Thanks, Chong

choishingwan commented 4 years ago
  1. Binary target is explicitly referring to the phenotype of your target sample. IT has nothing to do with the summary data.
  2. Yes, using --clump-p 0.1 will remove any SNPs with p-value larger than 0.1. How slow is it? With the latest version of PRSice, e.g. 2.2.x, the clumping on the UK biobank genotype data should take around 30 ~ 40 minutes depending on your computer (e.g. server loading). Whereas for the bgen format, it will definitely take a significant amount of time to complete
ChongWu-Biostat commented 4 years ago

For question 1, what's the difference between binary target and continuous target? Do we use the same formula \sum_{i=1}^{p}SNP_i * beta_i to calculate the PRS?

  1. For question 2, it takes more than 4 hours and only finishes 10% clumping. I am using the server. I will try it again to see if that only because I use a very old server or not.

Thank you for your help.

Thanks, Chong

choishingwan commented 4 years ago

--binary-target determine whether we should perform linear regression or logistic regression. The calculation of PRS is always the same

Definitely check whether the server is overloaded, esp with the I/O. For genotype file, it shouldn’t take so long.

ChongWu-Biostat commented 4 years ago

I tried it. It runs smoothly. I have one more question: How can we get the coefficient used for constructing the PRS?

Thank you for your help.

Thanks, Chong

choishingwan commented 4 years ago

It’s from your summary stat

You can use the —print-snp flag to print out post clump SNPs and then filter them by the p value. The beta used will be print alongside

On Sun, 27 Oct 2019 at 10:59 AM, Chong Wu notifications@github.com wrote:

I tried it. It runs smoothly. I have one more question: How can we get the coefficient used for constructing the PRS?

Thank you for your help.

Thanks, Chong

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/161?email_source=notifications&email_token=AAJTRYTPE3BEFN7JDAQOBZDQQXJGZA5CNFSM4JEI7UQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECLEFWY#issuecomment-546718427, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYQWFNQOVO7UKWYAALLQQXJGZANCNFSM4JEI7UQQ .

-- Dr Shing Wan Choi Postdoctoral Fellow Genetics and Genomic Sciences Icahn School of Medicine, Mount Sinai, NYC

ChongWu-Biostat commented 4 years ago

Got it. Thank you for your help.

oyhel commented 4 years ago

I am facing the same issue with regards to PRSice speed. After 2 hours I am looking at below 1% clumping progress. It seems this is going to take ages. The process is given 32 threads and 100gb ram. I/O read is approx 100 mb/s. Is this as expected or am I missing something fundamental here? @ChongWu-Biostat did you find the problem causing your job to stall?

ChongWu-Biostat commented 4 years ago

I guess this is related to the LD matrix. If you use the 1000 Genomes as references, it will not have this problem. Otherwise, you can try several times, sometimes it works, sometimes it may not.

choishingwan commented 4 years ago

It depends on the data. Are you using bgen data? How many SNPs and samples are you working on?

Sam

oyhel commented 4 years ago

I am working with binary plink files as I read bgen could be slow. I have approximately 98k samples in the bedsets going into the analyses, but I am keeping only 30k. I have approximately 3M markers post info score filtering of 0.9. I am currently trying to reduce the number of samples in the bedsets going into the analyses to see if that speed things up. Is the LD calculation performed before filtering by the --keep flag?

choishingwan commented 4 years ago

No, it is done after. But what happen is that:

For each SNP:

Usually, with 500K sample and 300K SNP, PRSice can finish clumping in around 10 minutes.

If you don't mind, you can try out a build that I am currently working on https://www.dropbox.com/s/s8ycohvlqcrkj6s/PRSice_linux?dl=0

This has two new feature:

  1. Multi-threading clumping - We will perform clumping on each chromosome separately
  2. Reduce file read - we will try to store genotype in the memory as we go, which reduce time we spend on reading the file. However, this will increase the memory consumption of PRSice.

Please let me know if this will help

Sam

On Fri, May 15, 2020 at 12:08 AM Øyvind Helgeland notifications@github.com wrote:

I am working with binary plink files as I read bgen could be slow. I have approximately 98k samples in the bedsets going into the analyses, but I am keeping only 30k. I have approximately 3M markers post info score filtering of 0.9. I am currently trying to reduce the number of samples in the bedsets going into the analyses to see if that speed things up. Is the LD calculation performed before filtering by the --keep flag?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/161#issuecomment-628734725, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYSZG6VMLXVA2RPG3ELRRQJRFANCNFSM4JEI7UQQ .

oyhel commented 4 years ago

Thank you for the info (and the fast reply!). Reducing the size of the input data seemed to noticeably speed up the calculation, but still and estimated ~hour per percent so a multi threaded version would be very welcome. I'll try the new version and report back.

oyhel commented 4 years ago

Wow, this was something else! It blasts through the clumping process in no time at all. What took days now takes like 3 minutes. I think storing the genotypes in memory is helping a lot, the speed increase is way higher than the ~32x speed increase one would expect from multi threading (I am using 32 cores). Great work!

choishingwan commented 4 years ago

Thanks! I am glad this works as expected.

I've found that clumping can sometimes require reading the same file 3~4 times, which cause a huge I/O loading. When the process are I/O bounded (which is usually the case for clumping), storing the data into memory drastically increase the speed of clumping. Also, because we are simply doing the clumping in each chromosome separately, we only really utilize 22 threads for human sample (22 autosomal chromosome)

Glad that it work as what I've anticipated. Good luck with your downstream analysis!