bulik / ldsc

LD Score Regression (LDSC)
GNU General Public License v3.0
644 stars 343 forks source link

FIXED - Fail to converst summary statistic in .sumstats format: munge_sumstats is taking hours #145

Open alesss78 opened 5 years ago

alesss78 commented 5 years ago

I am trying to reproduce the example provided in: https://github.com/bulik/ldsc/wiki/Heritability-and-Genetic-Correlation

In particular, I downloaded both the summary statistics file: wget www.med.unc.edu/pgc/files/resultfiles/pgc.cross.bip.zip and the list of SNPs: wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz

I unzipped both files and then I used munge_sumstats.py to start the file conversion as following: python //munge_sumstats.py --sumstats pgc.cross.BIP11.2013-05.txt --N 17115 --out scz --merge-alleles w_hm3.snplist

I obtain the following output that seems correct:

Call: ./munge_sumstats.py \ --out scz \ --merge-alleles w_hm3.snplist \ --N 17115.0 \ --sumstats pgc.cross.BIP11.2013-05.txt Interpreting column names as follows: info: INFO score (imputation quality; higher --> better imputation) snpid: Variant ID (e.g., rs number) a1: Allele 1, interpreted as ref allele for signed sumstat. pval: p-Value a2: Allele 2, interpreted as non-ref allele for signed sumstat. or: Odds ratio (1 --> no effect; above 1 --> A1 is risk increasing) Reading list of SNPs for allele merge from w_hm3.snplist Read 1217311 SNPs for allele merge. Reading sumstats from pgc.cross.BIP11.2013-05.txt into memory 5000000 SNPs at a time.

The program is then stuck after this. It uses 100% of one processor and only few gigas of ram. In the tutorial it is said this conversion should take about 20 seconds. On the contrary, I waited for about 1 hour but the conversion didn't finished.

Any hints on why the process is so slow? Any help would be appreciated Thank you

alesss78 commented 5 years ago

EDIT: I manged to make munge_sumstats.py complete Summary statistic conversion by reducing chunk size: by default chunksize = 5000000. I reduced it to 500000 by adding the option: --chunksize 500000. It worked as intended.

ttuowang commented 5 years ago

I encountered the same problem. Thanks for your solution, it saved me a lot of time.

YaoXueming commented 5 years ago

wow, it's really nice, thank you so much!

giuseppe-fanelli commented 4 years ago

thanks a lot

privefl commented 4 years ago

I got from 2 days to 1 minute with this option?!

xsun1229 commented 4 years ago

EDIT: I manged to make munge_sumstats.py complete Summary statistic conversion by reducing chunk size: by default chunksize = 5000000. I reduced it to 500000 by adding the option: --chunksize 500000. It worked as intended.

Great thanks

maryellenlynall commented 4 years ago

thanks!

ptn24 commented 3 years ago

+1

vkp3 commented 2 years ago

Incredible tip, hours and hours -> 1m 16s Thanks!

alicebraun commented 2 years ago

thanks so much for the useful hint! it works fine now :)

What-Ccat commented 1 year ago

Really solved the problem, thank you so much!

wgmao commented 9 months ago

Thanks from 2024!