Cloufield / gwaslab

A Python package for handling and visualizing GWAS summary statistics. https://cloufield.github.io/gwaslab/
GNU General Public License v3.0
151 stars 25 forks source link

out of memory error in basic_check() #89

Open evakoe opened 5 months ago

evakoe commented 5 months ago

Dear all, I am running gwaslab v3.4.43 on a 4.1GB metal summary stats file on SLURM calculation server as a submitted job with 64GB of memory. In the basic_check() function, I see in the log:

2024/04/18 11:53:52 Start to normalize indels...v3.4.43
2024/04/18 11:53:52  -Current Dataframe shape : 57872068 x 11 ; Memory usage: 4556.73 MB

Yet then the processed is killed by SLURM (I think because it ran out of memory). So how is this possible given that I provided 64GB of memory. Is there a way to set the maximum memory used by gwaslab? After all, I do have more than 4556.73 MB RAM available. Or will gwaslab just take the memory it needs?

Thank you an my apologies in case I missed this in the tutorial. Eva

Cloufield commented 5 months ago

Hi Eva,

I am wondering if you used any loops in your python script? That often causes memory leak as described here. Or did you use multiple cores for this?

If not, I am wondering what the context of basic_check() function is in your python. That will be helpful for troubleshooting. Thanks!

evakoe commented 5 months ago

Hi Yunye, indeed I did use a loop to iterate over multiple sumstats files but I always overwrote my object. Yet I tested with just one file (the largest) without the loop and I still get the out of memory error. My goal is to format my metal meta-analysis sum stats so that they can be read by PheWeb. The main conversion I need to do here is to align one allele to the reference, but I thought that running the basic_check() function before would be useful as a general check and to left align and normalize indels, so this is currently my first step after loading the input file. Thank you.

Cloufield commented 5 months ago

Hi Eva, thanks for your feedback. I tested with a similar-size (50m rows, 3GB) dataset on Mac with 16GB RAM and also on PC WSL with 32 GB RAM and both worked well. But I also monitored the memory usage during the process and indeed saw an sudden increase in memory usage for variant normalization when there is a large number of unnormalized indels in the sumstats. I will think about how to make it optimize this step. For now, a temporary solution is simply to split the sumstats (split in half or for each chromosome) to reduce memory usage and then combined them together after the process. Sorry for the inconvenience.

evakoe commented 5 months ago

Thank you, I will try and write again this does not work.