Closed ChongWu-Biostat closed 4 years ago
For question 1, what's the difference between binary target and continuous target? Do we use the same formula \sum_{i=1}^{p}SNP_i * beta_i to calculate the PRS?
Thank you for your help.
Thanks, Chong
--binary-target determine whether we should perform linear regression or logistic regression. The calculation of PRS is always the same
Definitely check whether the server is overloaded, esp with the I/O. For genotype file, it shouldn’t take so long.
I tried it. It runs smoothly. I have one more question: How can we get the coefficient used for constructing the PRS?
Thank you for your help.
Thanks, Chong
It’s from your summary stat
You can use the —print-snp flag to print out post clump SNPs and then filter them by the p value. The beta used will be print alongside
On Sun, 27 Oct 2019 at 10:59 AM, Chong Wu notifications@github.com wrote:
I tried it. It runs smoothly. I have one more question: How can we get the coefficient used for constructing the PRS?
Thank you for your help.
Thanks, Chong
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/161?email_source=notifications&email_token=AAJTRYTPE3BEFN7JDAQOBZDQQXJGZA5CNFSM4JEI7UQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECLEFWY#issuecomment-546718427, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYQWFNQOVO7UKWYAALLQQXJGZANCNFSM4JEI7UQQ .
-- Dr Shing Wan Choi Postdoctoral Fellow Genetics and Genomic Sciences Icahn School of Medicine, Mount Sinai, NYC
Got it. Thank you for your help.
I am facing the same issue with regards to PRSice speed. After 2 hours I am looking at below 1% clumping progress. It seems this is going to take ages. The process is given 32 threads and 100gb ram. I/O read is approx 100 mb/s. Is this as expected or am I missing something fundamental here? @ChongWu-Biostat did you find the problem causing your job to stall?
I guess this is related to the LD matrix. If you use the 1000 Genomes as references, it will not have this problem. Otherwise, you can try several times, sometimes it works, sometimes it may not.
It depends on the data. Are you using bgen data? How many SNPs and samples are you working on?
Sam
I am working with binary plink files as I read bgen could be slow. I have approximately 98k samples in the bedsets going into the analyses, but I am keeping only 30k. I have approximately 3M markers post info score filtering of 0.9. I am currently trying to reduce the number of samples in the bedsets going into the analyses to see if that speed things up. Is the LD calculation performed before filtering by the --keep flag?
No, it is done after. But what happen is that:
For each SNP:
Usually, with 500K sample and 300K SNP, PRSice can finish clumping in around 10 minutes.
If you don't mind, you can try out a build that I am currently working on https://www.dropbox.com/s/s8ycohvlqcrkj6s/PRSice_linux?dl=0
This has two new feature:
Please let me know if this will help
Sam
On Fri, May 15, 2020 at 12:08 AM Øyvind Helgeland notifications@github.com wrote:
I am working with binary plink files as I read bgen could be slow. I have approximately 98k samples in the bedsets going into the analyses, but I am keeping only 30k. I have approximately 3M markers post info score filtering of 0.9. I am currently trying to reduce the number of samples in the bedsets going into the analyses to see if that speed things up. Is the LD calculation performed before filtering by the --keep flag?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/161#issuecomment-628734725, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYSZG6VMLXVA2RPG3ELRRQJRFANCNFSM4JEI7UQQ .
Thank you for the info (and the fast reply!). Reducing the size of the input data seemed to noticeably speed up the calculation, but still and estimated ~hour per percent so a multi threaded version would be very welcome. I'll try the new version and report back.
Wow, this was something else! It blasts through the clumping process in no time at all. What took days now takes like 3 minutes. I think storing the genotypes in memory is helping a lot, the speed increase is way higher than the ~32x speed increase one would expect from multi threading (I am using 32 cores). Great work!
Thanks! I am glad this works as expected.
I've found that clumping can sometimes require reading the same file 3~4 times, which cause a huge I/O loading. When the process are I/O bounded (which is usually the case for clumping), storing the data into memory drastically increase the speed of clumping. Also, because we are simply doing the clumping in each chromosome separately, we only really utilize 22 threads for human sample (22 autosomal chromosome)
Glad that it work as what I've anticipated. Good luck with your downstream analysis!
I have one question about binary-target. If the effect sizes provided by GWAS summary data is in log(OR) scale, do we need to set --binary-target F?
For clump-p: If we set --clump-p 0.1, does it mean we only consider the SNPs with pvalue less than 0.1? I found using all SNPs are extremely slow for UK Biobank data and wonder if set --clump-p 0.1 is reasonable.
Thank you for your help!
Thanks, Chong