JonJala / mama

MIT License
13 stars 4 forks source link

Runtime for mama_ldscores.py #15

Closed samkleeman1 closed 3 years ago

samkleeman1 commented 3 years ago

I have been running the following script (with approx 1TB RAM), and I have seen no progress in more than 36 hours. Is this normal or am I doing something wrong?

Script:

python3 ./mama/mama_ldscores.py --ances-path "./iid_ances_file" \
                            --snp-ances "./snp_ances_file" \
                            --ld-wind-cm 1 \
                            --stream-stdout \
                            --bfile-merged-path "./ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8" \
                            --out "ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld"

Output to date:

Calling ./mama_ldscore.py \
--out ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld \
--stream-stdout  \
--snp-ances ./snp_ances_file \
--ances-path ./iid_ances_file \
--bfile-merged-path ./ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8 \
--ld-wind-cm 1.0 

Beginning to estimate cross-ancestry LD scores...
Read list of 69066 individuals from ./ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8.fam
Read list of 9557064 SNPs from ./ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8.bim
There are 0 SNPs in the merged .bim file without identified source of ancestry groups. These variants are dropped in the LD score estimation.
<><><<>><><><><><><><><><><><><><><><>
Read list of 9557065 SNPs to include from ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld.snplist
Read list of 9557065 SNPs to include from ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld.snplist.AFR
Read list of 9557065 SNPs to include from ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld.snplist.CSA
Read list of 9557065 SNPs to include from ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld.snplist.EAS
Read list of 9557065 SNPs to include from ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld.snplist.EUR
Read list of 7577 individuals to include from ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld.indlist.AFR
Read list of 9056 individuals to include from ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld.indlist.CSA
Read list of 2436 individuals to include from ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld.indlist.EAS
Read list of 50001 individuals to include from ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld.indlist.EUR

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
Estimating LD Score.
Reading genotypes from ./ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8.bed for LD estimation based on AFR-AFR
After filtering, 7576 individuals remain
After filtering, 9557059 SNPs remain
Begin calculating LD scores based on AFR-AFR
After filtering, 7576 individuals remain
After filtering, 9557059 SNPs remain
ggoldman1 commented 3 years ago

Have you instantiated the virtualenv?

ggoldman1 commented 3 years ago

Have you instantiated the virtualenv?

Nevermind, I doubt this is the issue. I wonder if it's just a matter of data size (this is many orders of magnitude larger than what I've been using with 1kG). Would it be annoying to run it chromosome-by-chromosome, maybe starting with chr22 to make sure nothing weird is happening?

samkleeman1 commented 3 years ago

I am using the virtualenv as far as I know.

Yes I imagine the sample sizes are way above what you are doing. Do I need to split by chromosome at the level of the PLINK files? I assume so. I am concerned it is taking this long for the AFR ancestry group as the EUR group is 5x bigger (I downsampled the EUR subjects in UKB to 50k, maybe I need to downsample further...)

samkleeman1 commented 3 years ago

I am now running in parallel across chromosomes, will keep you posted

ggoldman1 commented 3 years ago

Ok awesome yeah please let me know what happens. My hope is that for smaller chromsomes it should be relatively quick, but I'm curious to see what happens.

paturley commented 3 years ago

Great! Let us know how it goes. Just as a piece of maybe useful information. We were initially worried that a few hundred individuals in the 1kG data would be too small to get precise estimates of LD, but a bunch of simulations suggested that it's ok. So I suspect that if you down-sampled to even as small as a few thousand, you'd have more than enough data.

On Sun, May 9, 2021, 5:47 PM ggoldman1 @.***> wrote:

Ok awesome yeah please let me know what happens. My hope is that for smaller chromsomes it should be relatively quick, but I'm curious to see what happens.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JonJala/mama/issues/15#issuecomment-835899797, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5NF5XL4MGYK6QGEYGDTM37HXANCNFSM44PFGYGQ .

samkleeman1 commented 3 years ago

I am now getting this error. For reference I gave each job 280GB of memory. Should I increase the memory allowance? Am I correct to assume that it cannot deal with this number of samples? What is the minimum sample size that each ancestry group can be downsampled to, assuming you want accurate LD structure for variants at MAF 0.01?

2021/05/09 10:27:56 PM Unable to allocate 38.2 GiB for an array with shape (7576, 677069) and data type float64
Traceback (most recent call last):
  File "/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/mama/legacy/mama_ldscore.py", line 641, in main_func
    mama_LD_scores = estimate_LD_score_MAMA(args)
  File "/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/mama/legacy/mama_ldscore.py", line 514, in estimate_LD_score_MAMA
    lN_mat, lN_df, M, M_5_50 = multi_ldScoreVarBlocks(args, ances_ind, ances_flag, ances_n, snp_index, ind_index, array_file, array_obj, array_snps)
  File "/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/mama/legacy/mama_ldscore.py", line 155, in multi_ldScoreVarBlocks
    pair_ldscore = pair_ldScoreVarBlocks(args, t, j, ances_ind, eff_ances_flag, ances_n, c, block_left, array_obj, array_file, n, array_snps, snplist, indlist, bootstrap=None)
  File "/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/mama/legacy/mama_ldscore.py", line 241, in pair_ldScoreVarBlocks
    (A_trans, B_trans) = ld.scale_trans(A, B, sing_ind, exp)
  File "/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/mama/legacy/mama_ldcalc.py", line 140, in scale_trans
    A_new = (A_geno - A_avg) * A_factor
MemoryError: Unable to allocate 38.2 GiB for an array with shape (7576, 677069) and data type float64
ggoldman1 commented 3 years ago

I can't speak to the downsampling, but maybe you could try running it serially, each chromosome with the initial 1TB of RAM? I think most of it should run relatively quickly.

paturley commented 3 years ago

We didn't exhaustively test this, but we did find that the 300 individuals in each ancestry group from the 1kG data was enough to get precise estimates. If you wanted to be safe but still fast enough to run this in a reasonable amount of time, I bet you'd be very comfortable using 1000 individuals.

On Mon, May 10, 2021 at 9:21 AM ggoldman1 @.***> wrote:

I can't speak to the downsampling, but maybe you could try running it serially, each chromosome with the initial 1TB of RAM? I think most of it should run relatively quickly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JonJala/mama/issues/15#issuecomment-836691302, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5LW57OHFIY6UKRLA53TM7MUHANCNFSM44PFGYGQ .

samkleeman1 commented 3 years ago

To clarify I was running each chromosome serially with 300GB of RAM and I am getting error messages to the effect of:

2021/05/10 12:39:25 AM Unable to allocate 16.5 GiB for an array with shape (7576, 291792) and data type float64

Seems odd considering a lot more memory than this was available

samkleeman1 commented 3 years ago

I have now been running with 1000 samples in each ancestry group for the most 17 hours, split by chromosome - and no progress so far, none of the scripts have got beyond the following:

Is this normal?

2021/05/10 05:09:43 PM 
<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<>
<> MAMA: Multi-Ancestry Meta-Analysis 
<> Version: 1.0.0
<> (C) 2020 Grant Goldman, Hui Li, Alicia Martin, Patrick Turley and Raymond Walters
<> Harvard University Department of Economics / Broad Institute of MIT and Harvard
<> GNU General Public License v3
<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<> Software-related correspondence: grantgoldman0@gmail.com or jjala.ssgac@gmail.com
<> All other correspondence: paturley@broadinstitute.org 
<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Calling ./mama_ldscore.py \
--out ../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr22 \
--stream-stdout  \
--snp-ances ../snp_ances_file \
--ances-path ../iid_ances_file \
--bfile-merged-path ../ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_22 \
--ld-wind-cm 1.0 

2021/05/10 05:09:43 PM Beginning to estimate cross-ancestry LD scores...
2021/05/10 05:09:43 PM Read list of 4000 individuals from ../ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_22.fam
2021/05/10 05:09:44 PM Read list of 131525 SNPs from ../ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_22.bim
2021/05/10 05:09:59 PM There are 0 SNPs in the merged .bim file without identified source of ancestry groups. These variants are dropped in the LD score estimation.
2021/05/10 05:09:59 PM <><><<>><><><><><><><><><><><><><><><>
2021/05/10 05:09:59 PM Read list of 131526 SNPs to include from ../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr22.snplist
2021/05/10 05:09:59 PM Read list of 131526 SNPs to include from ../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr22.snplist.AFR
2021/05/10 05:10:00 PM Read list of 131526 SNPs to include from ../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr22.snplist.CSA
2021/05/10 05:10:00 PM Read list of 131526 SNPs to include from ../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr22.snplist.EAS
2021/05/10 05:10:00 PM Read list of 131526 SNPs to include from ../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr22.snplist.EUR
2021/05/10 05:10:00 PM Read list of 1001 individuals to include from ../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr22.indlist.AFR
2021/05/10 05:10:00 PM Read list of 1001 individuals to include from ../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr22.indlist.CSA
2021/05/10 05:10:00 PM Read list of 1001 individuals to include from ../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr22.indlist.EAS
2021/05/10 05:10:00 PM Read list of 1001 individuals to include from ../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr22.indlist.EUR
2021/05/10 05:10:00 PM 
<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
2021/05/10 05:10:00 PM Estimating LD Score.
2021/05/10 05:10:00 PM Reading genotypes from ../ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_22.bed for LD estimation based on AFR-AFR
2021/05/10 05:10:12 PM After filtering, 1000 individuals remain
2021/05/10 05:10:14 PM After filtering, 130492 SNPs remain
2021/05/10 05:10:14 PM Begin calculating LD scores based on AFR-AFR
2021/05/10 05:10:25 PM After filtering, 1000 individuals remain
2021/05/10 05:10:27 PM After filtering, 130492 SNPs remain
ggoldman1 commented 3 years ago

It does seem pretty slow, but I'd let it hang for a bit longer. I'd be curious to see where the bottleneck is and if we can do anything to speed it up. I assume you haven't gotten any more print lines yet?

paturley commented 3 years ago

How long did it take for the 300 1kG samples you used, Grant? How many variants did you include?

On Tue, May 11, 2021 at 12:24 PM ggoldman1 @.***> wrote:

It does seem pretty slow, but I'd let it hang for a bit longer. I'd be curious to see where the bottleneck is and if we can do anything to speed it up. I assume you haven't gotten any more print lines yet?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JonJala/mama/issues/15#issuecomment-838780966, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5KLS2SFJUNVPP7DV5TTNFK2PANCNFSM44PFGYGQ .

ggoldman1 commented 3 years ago

It took me around an hour to run chromosome 22 of 1kG. 481 individuals/107,651 SNPs in EAS and 489 individuals/123,294 SNPs in EUR.

samkleeman1 commented 3 years ago

In that case we are seeing quite radically slower performance. I am not sure what the bottleneck would be?

ggoldman1 commented 3 years ago

Would it be possible to send me your chr22 AFR data? I'd like to try to reproduce the behavior.

samkleeman1 commented 3 years ago

That is a little tricky, it's UK biobank data - are you named on a biobank application? I can show how to generate the data but it's actually very time consuming/computationally expensive to extract the imputed data from the BGEN files and convert into plink

ggoldman1 commented 3 years ago

Yeah that's true. I can try to generate comparably sized filesets locally and see what happens.

ggoldman1 commented 3 years ago

So I've found the issue -- the calls to mama_ldcalc.scale_trans() (ie, line 241 on mainline mama_ldscore.py) are really slow. I'm a bit confused by this because we haven't touched that method in a long time but in case you're curious that's slowing everything down quite a bit.

I'm going to try to figure out a fix for this ASAP and get back to you. My apologies for the inconvenience.

paturley commented 3 years ago

Great! Thanks, Grant!

On Tue, May 11, 2021, 4:29 PM ggoldman1 @.***> wrote:

So I've found the issue -- the calls to mama_ldcalc.scale_trans() (ie, line 241 on mainline mama_ldscore.py) are really slow. I'm a bit confused by this because we haven't touched that method in a long time but in case you're curious that's slowing everything down quite a bit.

I'm going to try to figure out a fix for this ASAP and get back to you. My apologies for the inconvenience.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JonJala/mama/issues/15#issuecomment-839136623, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5ICQONMTIHMCV7Y2QDTNGHRFANCNFSM44PFGYGQ .

ggoldman1 commented 3 years ago

Hey @samkleeman1,

Just wanted to let you know we're still working on this. We really apologize for the trouble. Have there been any updates on your end?

samkleeman1 commented 3 years ago

No updates from my end, just waiting to hear about a fix


From: ggoldman1 @.> Sent: Friday, May 14, 2021 4:15:35 PM To: JonJala/mama @.> Cc: samkleeman1 @.>; Mention @.> Subject: Re: [JonJala/mama] Runtime for mama_ldscores.py (#15)

Hey @samkleeman1https://github.com/samkleeman1,

Just wanted to let you know we're still working on this. We really apologize for the trouble. Have there been any updates on your end?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/JonJala/mama/issues/15#issuecomment-841478231, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AET2KEELQZMQR5PJ5QEEOB3TNWAGPANCNFSM44PFGYGQ.

ggoldman1 commented 3 years ago

Hi Sam,

Just to keep you updated. We noticed some interesting behavior where a comparably sized sample from 1000 Genomes ran very reasonably on chromosome 22 over three populations (12 minutes in total). I was able to replicate your slow runtime on a sample of UKB data we have internally. We're working to get to the bottom of this behavior. Thanks again for your patience.

Grant

samkleeman1 commented 3 years ago

Hi Grant,

Thanks so much for looking into this. We are very keen to use the package. If you are able to provide references covering EUR AFR and CSA populations from UKB or 1G then we would not need to generate any references ourselves.

Sam


From: ggoldman1 @.> Sent: Friday, May 21, 2021 5:48:13 PM To: JonJala/mama @.> Cc: samkleeman1 @.>; Mention @.> Subject: Re: [JonJala/mama] Runtime for mama_ldscores.py (#15)

Hi Sam,

Just to keep you updated. We noticed some interesting behavior where a comparably sized sample from 1000 Genomes ran very reasonably on chromosome 22 over three populations (12 minutes in total). I was able to replicate your slow runtime on a sample of UKB data we have internally. We're working to get to the bottom of this behavior. Thanks again for your patience.

Grant

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/JonJala/mama/issues/15#issuecomment-846279375, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AET2KEDSMG5JCBODY3UNK6DTO3IJ3ANCNFSM44PFGYGQ.

ggoldman1 commented 3 years ago

Of course! We're excited that you're interested. Either way we need to resolve this UKB issue. The reference I've been working with is AFR-EAS-EUR, corresponding to the simulation and empirical application in the paper. So I don't think this is exactly what you need. However, if you have 1kG data handy I'd be curious to see if the performance is more reasonable for you.

maxzylam commented 3 years ago

Hi Grant and Jon,

I am having similar issues with the ldscores computation as well.

I did notice that there is a NUMEXPR_MAX_THREADS option that I can’t seem to find how to modify in the scripts (I was hoping to push the multi-threading so a larger number to speed things up more). Is there a --flag that could modify this parameter?

Right now the script is detecting the total cores that I have (44 cores) - but then defaulting it to 8 cores. If we have got control over the multi-threading it might help make the computation faster.

JonJala commented 3 years ago

Hi, Max -

When you say there's a NUMEXPR_MAX_THREADS option, what is that a part of?

Googling around, there looks to be an environment variable of that name that NumExpr, a Python extension library that Pandas and Numpy can use will examine when trying to determine the maximum for numbers of threads...is that what you mean? If it's an environment variable, you could try just setting that before running, though I don't believe we explicitly pull in NumExpr at this time.

We haven't experimented with or tested multi-threaded options a ton yet, and definitely not to the level of releasing a change for everyone, but we'll try to work on some optimizations like that.

Also, when you say you are having some LD score computation issues, what data sources are you using? Currently it sounds like 1000G produces much faster results than UKB. Are you using either of those two, or are you using alternate sources for your data? It might help us pinpoint what differences in the data might be contributing to the problem.

Thanks,

On Tue, May 25, 2021 at 11:12 AM Max Lam @.***> wrote:

Hi Grant and Jon,

I am having similar issues with the ldscores computation as well.

I did notice that there is a NUMEXPR_MAX_THREADS option that I can’t seem to find how to modify in the scripts (I was hoping to push the multi-threading so a larger number to speed things up more). Is there a --flag that could modify this parameter?

Right now the script is detecting the total cores that I have (44 cores) - but then defaulting it to 8 cores. If we have got control over the multi-threading it might help make the computation faster.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JonJala/mama/issues/15#issuecomment-847947050, or unsubscribe https://github.com/notifications/unsubscribe-auth/APIOF57DZGKSVRRWOADOSWTTPO44DANCNFSM44PFGYGQ .

kapoormanav commented 3 years ago

Hi Grant,

I had the same performance issue with SAS and EUR data. With nearly 800 samples the program nearly full day to calculate the LDscore for chromosome 22. For test purposes I tried GCTAs LDscore (https://cnsgenomics.com/software/gcta/#ComputingLDscores) implementation with 5 threads and it took 1 second to complete the score calculation for chromosome 22 with same number of subjects. The scores were comparable across MAMA and GCTA. Can you implement the LDSC calculations via GCTA, that can be called by MAMA if user is interested in fast calculations of LDscores?

I had one another minor issue also with MAMA. The program gave error when the IIDs in the fam files were numeric. It was not able to match IIDs to subject index file. When I made IIDs alphanumeric by adding a alphabet prefix, the program worked fine.

Thanks,

Manav

ggoldman1 commented 3 years ago

Hi @kapoormanav @samkleeman1

So I think the issue causing this slow runtime is somewhat trivial and mostly just an oversight. I think the issue is that you are specifying your window size (ie one of the --ld-wind-* flags) in a unit that isn't present in the .bim file. For example, my UKB data doesn't report SNP positions in centimorgans, so specifying --ld-wind-cm 1 will cause the runtime issue (and I assume produce incorrect results).

First, can you confirm that the third column in your .bim file is all 0's and you are specifying the window size in centimorgans? If so, you can try to fill in the missing cm data (it seems like from https://www.biostars.org/p/52820/ you can get this data from UCSC Genome Browser), or, alternatively, approximate the centimorgan position with 1000 kilobases (use the --ld-wind-kb 1000 flag). Obviously the latter will only work if you have the bp position in your .bim file.

I think this should take care of it -- once you implement this could you post here to let us know if it's resolved? Also, @kapoormanav I will open a PR for this issue merging the tables together. Thanks to you and @samkleeman1 for raising it, we appreciate it!

kapoormanav commented 3 years ago

Hi Grant,

Yes, the CM column is set to 0 in my bim file. Thanks for the link to update the CM positions. The test calculations on chromosome 21 and 22 with updated positions worked really fast.

Thanks a lot!

ggoldman1 commented 3 years ago

Thanks!