Memory Limit? - Githubissues

AnantharamanLab / METABOLIC

A scalable high-throughput metabolic and biogeochemical functional trait profiler

177 stars 44 forks source link

Memory Limit? #76

Open jlw-ecoevo opened 2 years ago

jlw-ecoevo commented 2 years ago

Is there a way to set a memory cap for metabolic? Is it sufficient just to reduce the number of cores used? (I was running w/ 40 cores and it ate up 1TB of RAM, which was unexpected and a little unkind to my lab mates)

ChaoLab commented 2 years ago

We don't have an option to control the mem usage. Do you know in which step METABOLIC takes such a high mem? I might be able to adjust this

jlw-ecoevo commented 2 years ago

Hi! I believe it was in both the KEGG Ortholog steps and dbcan steps (I ended up cancelling the job during the dbcan step).

ChaoLab commented 2 years ago

Hi! The two steps involved with hmmsearch and hmmscan. I found this: Since HMMER 4 is still in development, it seems that that's it. Maybe reducing the CPU thread num is the only way

jlw-ecoevo commented 2 years ago

Ah ok thanks - might be worth warning users about this in the docs - wasn't expecting the memory load to be so high

ChaoLab commented 2 years ago

Sure. I will add this info in the GitHub

jlw-ecoevo commented 2 years ago

Also - are you 100% sure it's hmmsearch's fault? The first hmmsearch step seems to run without any issues " The hmmsearch is running with 40 cpu threads..." and each hmmsearch process takes very little RAM (I've never run into memory issues w/ hmmsearch and have often ran it on many more cores for big jobs). The processes that seem to be taking up a lot of RAM are perl processes in the later KEGG/dbCAN steps after the hmmsearch step has finished. For reference this is running METABOLIC-G.pl on a folder with aa sequences from around 9k genomes. Thanks!

ChaoLab commented 2 years ago

Hi, after hmmsearch, we will use a hash to store all the hit names for each genome. If the input genome number is very big, then the hash will be very big too. Maybe this is the reason. By the way, it will take a very long time to process 9k genomes. It is suggested that you can divide them into 2k-genome containing batches