bvilhjal / ldpred

MIT License
96 stars 57 forks source link

RUN LDpred chromosome by chromosome #13

Closed wavefancy closed 6 years ago

wavefancy commented 6 years ago

Hi,

I had made the LDpred can be run in parallel chromosome by chromosome, this can save the waiting time and also reduce significantly the memory needed, and I also benchmarked the results, it output the same results. Then is there a way I can submit my modification, this may help others.

Here is my code: https://github.com/wavefancy/WallaceBroad/tree/master/python/LDPred.CHRbyCHR

Wallace

biostat0903 commented 6 years ago

Hi, I also want to do the analysis chromosome by chromosome. Do you have some comments of the problem?

carbocation commented 6 years ago

To extend this idea, since LDpred runs within defined windows of adjacent SNPs, wouldn't it be reasonable to chunk the genome into sub-chromosome partitions for the purpose of parallelizing this tool? Or am I missing some place in the code that assesses more distant linkage?

bvilhjal commented 6 years ago

Hi Wallace and colleagues,

I apologise for my extreme lack of response, but I left science (for now) and haven’t maintained LDpred (nor really responded to LDpred comments/emails) for the last couple of years.  I have, however, now decided to try to become a better person 😊 (wrt replying LDpred comments/email).

To do what you want to do, you can make a "pull request" (here on github), which I'll then accept (if it makes sense).

Best, Bjarni

wavefancy commented 6 years ago

Currently, I broke down the LDpred into two steps, 1) One step is the estimation of the LD structure and also gather data for estimation genome-wide inflation factor. The process is chr by chr. 2) The next step is to do the beta reweight based on the summary data across all of the chrs from step one, chr by chr.

The minor issue, I changed the scipy.linalg.pinv to numpy.linalg.pinv, I found which is more stable on my version of python, but not guarantee in all situations.

I think it's better to fork your source and make detail tutorials on how to run chr by chr in order to avoid the contamination of your code base.

Best, Wallace

nurfatimaj commented 6 years ago

Hello! Apart from the technical side of the issue, I am also interested in the theoretical part. Currently, my team needs to compute PRS and doing it chromosome-by-chromosome is the most computationally feasible way. However, I have found some articles that claim that LD can span across chromosomes, in which case chromosome-by-chromosome approach might be too restrictive. I was hoping you could share your opinions about this trade off. Thank you!

biostat0903 commented 6 years ago

I always ignore the LD structure between different chromosome. Then the calculation burden is low. When I use the ldpred chromosome-by-chromosome, many chromosome will be failed. Because of the genome inflation factor is smaller than 1.

wavefancy commented 6 years ago

Hi Sheng,

You can not directly run LDpred chr by chr, just by splitting the input data, you need to estimate the heritability and inflation factor globally, which needs twist the code. I am making my code and tutorial public available in a day or two.

Best regards Wallace

On Wed, Sep 26, 2018 at 10:22 AM Sheng Yang notifications@github.com wrote:

I always ignore the LD structure between different chromosome. Then the calculation burden is low. When I use the ldpred chromosome-by-chromosome, many chromosome will be failed. Because of the genome inflation factor is smaller than 1.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bvilhjal/ldpred/issues/13#issuecomment-424733952, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZncjFYOFVtWEMcHZEdFK0ci5GIZgq2ks5ue42agaJpZM4VTo3f .

biostat0903 commented 6 years ago
font{
    line-height: 1.6;
}
ul,ol{
    padding-left: 20px;
    list-style-position: inside;
}

Many thanks for you kindly reply and help. 

    font{
        line-height: 1.6;
    }

    font{
        line-height: 1.6;
    }

    font{
        line-height: 1.6;
    }

Sheng YangPh.D, Postdoctral fellowCenter for Statistical Genetics in Department of BiostatisticsUniversity of MichiganDepartment of BiostatisticsNanjing Medical University

On 9/26/2018 10:30,wavefancy<notifications@github.com> wrote: 

Hi Sheng,

You can not directly run LDpred chr by chr, just by splitting the input data, you need to estimate the heritability and inflation factor globally, which needs twist the code. I am making my code and tutorial public available in a day or two.

Best regards Wallace

On Wed, Sep 26, 2018 at 10:22 AM Sheng Yang notifications@github.com wrote:

I always ignore the LD structure between different chromosome. Then the calculation burden is low. When I use the ldpred chromosome-by-chromosome, many chromosome will be failed. Because of the genome inflation factor is smaller than 1.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bvilhjal/ldpred/issues/13#issuecomment-424733952, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZncjFYOFVtWEMcHZEdFK0ci5GIZgq2ks5ue42agaJpZM4VTo3f .

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or mute the thread.

wavefancy commented 6 years ago

Here's the tutorial and my code for run LDpred chr by chr: https://github.com/wavefancy/WallaceBroad/tree/master/python/LDPred.CHRbyCHR

nurfatimaj commented 6 years ago

Amazing! Thanks a lot! :)

On 26 Sep 2018, at 20:27, wavefancy notifications@github.com<mailto:notifications@github.com> wrote:

Here's the tutorial and my code for run LDpred chr by chr: https://github.com/wavefancy/WallaceBroad/tree/master/python/LDPred.CHRbyCHRhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwavefancy%2FWallaceBroad%2Ftree%2Fmaster%2Fpython%2FLDPred.CHRbyCHR&data=02%7C01%7Cnurfatima.jandarova%40eui.eu%7C6d890b247f774b1e328208d623ddb5ea%7Cd3f434ee643c409f94aa6db2f23545ce%7C0%7C0%7C636735832495263932&sdata=%2FEzJGrR2Spkf%2FMEY6GhNZg3g%2F%2Befmie9%2BRgps%2FWSexI%3D&reserved=0

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbvilhjal%2Fldpred%2Fissues%2F13%23issuecomment-424821292&data=02%7C01%7Cnurfatima.jandarova%40eui.eu%7C6d890b247f774b1e328208d623ddb5ea%7Cd3f434ee643c409f94aa6db2f23545ce%7C0%7C0%7C636735832495263932&sdata=ejj%2BaPyVC68jM3rDyB7UG04aZYLfe5auT2su%2Fw%2BGnUs%3D&reserved=0, or mute the threadhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAXHNgwo0o3Ys2Pyo3jP2J5zgmVOqA0R4ks5ue8cNgaJpZM4VTo3f&data=02%7C01%7Cnurfatima.jandarova%40eui.eu%7C6d890b247f774b1e328208d623ddb5ea%7Cd3f434ee643c409f94aa6db2f23545ce%7C0%7C0%7C636735832495273937&sdata=DA%2BMuohuKkckXEZFLv9CB3tjffsdSIcfBNlRz6tXhZ8%3D&reserved=0.

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination, distribution, forwarding, or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited without the express permission of the sender. If you received this communication in error, please contact the sender and delete the material from any computer.

bvilhjal commented 6 years ago

Thanks @wavefancy, I really appreciate it! I have finally found some time to work on LDpred, and will likely spend some time on making it more suitable for large datasets.

bvilhjal commented 6 years ago

Dear @labuve, yes one should ignore LD between chromosomes. If there is substantial LD between chromosomes, it suggests population or family structure, and the dataset would therefore likely not be a good estimate for local LD (which LDpred uses it for).

uqzqiao commented 6 years ago

Here's the tutorial and my code for run LDpred chr by chr: https://github.com/wavefancy/WallaceBroad/tree/master/python/LDPred.CHRbyCHR

Hi wavefancy,

Thanks for your contribution! While I was using your code to run LDpred chr by chr, I met the following issue which I don't know how to deal with it. May I seek your help on this matter? Thanks in advance! The issue is in step 2, the log file is (take chr1 as an example)

"Note: For maximal accuracy all SNPs with LDpred weights should be included in the validation data set. If they are a subset of the validation data set, then we suggest recalculate LDpred for the overlapping SNPs.

Calculating LD information w. radius 25934 Working on chrom_1 149974 503 Done calculating the LD table and LD score, writing to file: test_ldpred_chr1.gz Genome-wide average LD score was: 42.6342343766 Traceback (most recent call last): File "/home/LDpred2/ldpred/LDpred.getLocalLDFile.CHR.Wallace.V1.py", line 634, in main() File "/home/LDpred2/ldpred/LDpred.getLocalLDFile.CHR.Wallace.V1.py", line 612, in main cPickle.dump(ld_dict, f, protocol=2) SystemError: error return without exception set "

I still get the .gz and _byFileCache.txt files, but I am not confident whether they can be used in the following steps. Thanks!

wavefancy commented 6 years ago

Hi,

I never have this error before. But when I check the code. At this step, it seems the computing is all done, but there's problem to write out the results to the files, please check the output file related issues, like file name, path, does disk full etc.

can also test your python setup by below code.

import cPickle local_ld_dict_file="test" f = gzip.open(local_ld_dict_file, 'wb') ld_dict = {1:2} cPickle.dump(ld_dict, f, protocol=2) f.close()

Best regards Wallace

On Fri, Nov 16, 2018 at 11:39 PM uqzqiao notifications@github.com wrote:

Here's the tutorial and my code for run LDpred chr by chr:

https://github.com/wavefancy/WallaceBroad/tree/master/python/LDPred.CHRbyCHR

Hi wavefancy,

Thanks for your contribution! While I was using your code to run LDpred chr by chr, I met the following issue which I don't know how to deal with it. May I seek your help on this matter? Thanks in advance! The issue is in step 2, the log file is (take chr1 as an example)

"Note: For maximal accuracy all SNPs with LDpred weights should be included in the validation data set. If they are a subset of the validation data set, then we suggest recalculate LDpred for the overlapping SNPs.

Calculating LD information w. radius 25934 Working on chrom_1 149974 503 Done calculating the LD table and LD score, writing to file: test_ldpred_chr1.gz Genome-wide average LD score was: 42.6342343766 Traceback (most recent call last): File "/home/LDpred2/ldpred/LDpred.getLocalLDFile.CHR.Wallace.V1.py", line 634, in main() File "/home/LDpred2/ldpred/LDpred.getLocalLDFile.CHR.Wallace.V1.py", line 612, in main cPickle.dump(ld_dict, f, protocol=2) SystemError: error return without exception set "

I still get the .gz and _byFileCache.txt files, but I am not confident whether they can be used in the following steps. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bvilhjal/ldpred/issues/13#issuecomment-439587649, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZncnLwOJzejdLVrA8AGVJQ3Iuq2WELks5uv5MUgaJpZM4VTo3f .

uqzqiao commented 6 years ago

Hi, I never have this error before. But when I check the code. At this step, it seems the computing is all done, but there's problem to write out the results to the files, please check the output file related issues, like file name, path, does disk full etc. # can also test your python setup by below code. import cPickle local_ld_dict_file="test" f = gzip.open(local_ld_dict_file, 'wb') ld_dict = {1:2} cPickle.dump(ld_dict, f, protocol=2) f.close() Best regards Wallace On Fri, Nov 16, 2018 at 11:39 PM uqzqiao @.**> wrote: Here's the tutorial and my code for run LDpred chr by chr: https://github.com/wavefancy/WallaceBroad/tree/master/python/LDPred.CHRbyCHR Hi wavefancy, Thanks for your contribution! While I was using your code to run LDpred chr by chr, I met the following issue which I don't know how to deal with it. May I seek your help on this matter? Thanks in advance! The issue is in step 2, the log file is (take chr1 as an example) "Note: For maximal accuracy all SNPs with LDpred weights should be included in the validation data set. If they are a subset of the validation data set, then we suggest recalculate LDpred for the overlapping SNPs. Calculating LD information w. radius 25934 Working on chrom_1 149974 503 Done calculating the LD table and LD score, writing to file: test_ldpred_chr1.gz Genome-wide average LD score was: 42.6342343766 Traceback (most recent call last): File "/home/LDpred2/ldpred/LDpred.getLocalLDFile.CHR.Wallace.V1.py", line 634, in main() File "/home/LDpred2/ldpred/LDpred.getLocalLDFile.CHR.Wallace.V1.py", line 612, in main cPickle.dump(ld_dict, f, protocol=2) SystemError: error return without exception set " I still get the .gz and *_byFileCache.txt files, but I am not confident whether they can be used in the following steps. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZncnLwOJzejdLVrA8AGVJQ3Iuq2WELks5uv5MUgaJpZM4VTo3f .

Hi Wallace,

Thank you very much for your quick reply! I've tried everything, however, I still can't solve it. The test code works fine. I'll get back to this issue later when I have more time. Thanks again!

Best wishes, Jenny