ParkinsonLab / MetaPro

GNU General Public License v3.0
17 stars 3 forks source link

Memory problems with Detect enzyme annotation #25

Closed tmattes8 closed 2 months ago

tmattes8 commented 5 months ago

Noticed that Detect wasn't running correctly during enzyme annotation (Config file referred to Detect_2.2.9.py, which is now Detect_2.2.10.py).

Anyway, after I fixed that now Detect is causing my HPC job to crash with memory problems. I throttled back to 1 concurrent job using Config file to no avail. I even tried using high memory HPC nodes that are available to me, and the memory required exceeded the high memory nodes!

If you look at the detect_out.txt it still launches 1400 jobs within that one allowed instance (see excerpt below):

2024-05-04 15:09:41.692664 1400 jobs launched! new name: MCS5673935.1 MCS5673935.1 name used: MCS5673935.1

I think these 1400 Detect jobs are causing the memory problem and my pipeline to crash (note I was using 112 nodes up to this point to make Diamond work which is quite a bit of memory already). It might be something about my dataset, but I am only running a single read pair of data (two fastq files at 2.5 GB each)!

I've looked at the Detect_2.2.10.py file and see some language in there about a "max_open_file_limit" and other things about memory (see below), but for some reason I did not trip any open file limit or memory limit despite telling it to have only 1 concurrent process in Config.ini

Do you have any suggestions on how to back the Detect job launching off a little bit accordiing to the python code below?

if(process_counter > max_open_file_limit): print(dt.today(), "limit of open jobs reached:" + str(max_open_file_limit) + "|closing before launching more.") for item in process_list: item.join() process_list[:] = [] process_counter = 0

            else:
                #print(dt.today(), "mem limit reached")
                time.sleep(0.001)#DETECT_job_delay)

        else:
            #print(dt.today(), "process launch paused due to cpu limit")
            time.sleep(0.001)
tmattes8 commented 4 months ago

Update: I managed to get this to work for my dataset. Strangely, it worked when I allowed 112 concurrent process (all the cores I was using) by altering the config.ini. It didn't work with the default 40 or less. That was counterintuitive since I've had to dial back the DIAMOND jobs to 5 at a time due to memory issues. In the end it took 72 hours to run the Detect step alone with over 293000 jobs.

That seems to be the final obstacle I had to overcome to run a single read pair of data through the pipeline. I'm still not exactly satisfied with the taxa outputs in the final steps (i.e., it does not seem right to me as compared to output from a different pipeline), but I will comment about that in another issue brought by a different user.

Noticed that Detect wasn't running correctly during enzyme annotation (Config file referred to Detect_2.2.9.py, which is now Detect_2.2.10.py).

Anyway, after I fixed that now Detect is causing my HPC job to crash with memory problems. I throttled back to 1 concurrent job using Config file to no avail. I even tried using high memory HPC nodes that are available to me, and the memory required exceeded the high memory nodes!

If you look at the detect_out.txt it still launches 1400 jobs within that one allowed instance (see excerpt below):

2024-05-04 15:09:41.692664 1400 jobs launched! new name: MCS5673935.1 MCS5673935.1 name used: MCS5673935.1

I think these 1400 Detect jobs are causing the memory problem and my pipeline to crash (note I was using 112 nodes up to this point to make Diamond work which is quite a bit of memory already). It might be something about my dataset, but I am only running a single read pair of data (two fastq files at 2.5 GB each)!

I've looked at the Detect_2.2.10.py file and see some language in there about a "max_open_file_limit" and other things about memory (see below), but for some reason I did not trip any open file limit or memory limit despite telling it to have only 1 concurrent process in Config.ini

Do you have any suggestions on how to back the Detect job launching off a little bit accordiing to the python code below?

if(process_counter > max_open_file_limit): print(dt.today(), "limit of open jobs reached:" + str(max_open_file_limit) + "|closing before launching more.") for item in process_list: item.join() process_list[:] = [] process_counter = 0

            else:
                #print(dt.today(), "mem limit reached")
                time.sleep(0.001)#DETECT_job_delay)

        else:
            #print(dt.today(), "process launch paused due to cpu limit")
            time.sleep(0.001)