AnantharamanLab / METABOLIC

A scalable high-throughput metabolic and biogeochemical functional trait profiler
178 stars 45 forks source link

readline() on closed filehandle _IN at METABOLIC-C.pl line 1909. Cannot fork: Cannot allocate memory at /home/hyShen/miniconda3/envs/METABOLIC #193

Open hyShen-hzau opened 3 months ago

hyShen-hzau commented 3 months ago

Describe the bug After successfully running the test, I cannot run the process with my data. And the same error is reported many times.

To Reproduce Steps to reproduce the behavior:

  1. Go to ' perl METABOLIC-C.pl -in-gn /home/hyShen/Metabolic/Sharp/test1 -r /home/hyShen/Metabolic/test.txt -o result1 -t 60'
  2. Click on '....'
  3. Scroll down to '[2024-08-08 22:21:13] The Prodigal annotation is running... [2024-08-09 00:12:11] The Prodigal annotation is finished readline() on closed filehandle _IN at METABOLIC-C.pl line 1909. [2024-08-09 00:23:36] The hmmsearch is running with 60 cpu threads... Cannot fork: Cannot allocate memory at /home/hyShen/miniconda3/envs/METABOLIC_v4.0/lib/perl5/site_perl/5.22.0/Parallel/ForkManager.pm line 52. '
  4. See error The hmmsearch is running with 60 cpu threads... Cannot fork: Cannot allocate memory at /home/hyShen/miniconda3/envs/METABOLIC_v4.0/lib/perl5/site_perl/5.22.0/Parallel/ForkManager.pm line 52.
snpone commented 3 months ago

A potentially useful modification when system resources are low is to alter the code slightly. By using search and replace, you can change the original line that looks like this:

_run_parallel("$output/tmp_calculate_depth.sh", $i); `rm $output/tmp_calculate_depth.sh`;

to:

system("bash $output/tmp_calculate_depth.sh"); `rm $output/tmp_calculate_depth.sh`;

This can greatly reduce system load.

Possible reason: I'm an experienced bioinformatics worker who has been working with Perl for a long time. This code is very well written. However, the author might have overlooked something. The issue lies around line 2380 in the code with this function:

sub _run_parallel{
    my $file = $_[0];
    my $cpu_numbers_ = $_[1];
    my @Runs; 
    open ___IN, $file;
    while (<___IN>){
        chomp;
        push @Runs, $_;
    }
    close ___IN;

    my $pm = Parallel::ForkManager->new($cpu_numbers_);
    foreach my $run (@Runs){
        my $pid = $pm->start and next;
        `$run`;
        $pm->finish;
    }
    $pm->wait_all_children;
}

This function defines the initial number of threads to ensure that many threads are running simultaneously, similar to the parallel command in shell scripting, which in itself is not a problem. However, the issue arises because each sub-command in the generated temporary tmp_xxx.sh scripts specifies something like samtools sort -@ 40 -- for instance, 40 threads; this means that if you have 100 fq data files, then the actual CPU resources being called upon are 40x40 = 1600 threads running at the same time. This can severely impact disk I/O and system load balance, potentially leading to system paralysis.

hyShen-hzau commented 3 months ago

Thanks for your help. I found the two lines "_run_parallel("$output/tmp_calculate_depth.sh", $i); rm $output/tmp_calculate_depth.sh;" in METABOLIC-C.pl and replaced them. I am currently testing with a small batch of data. Thanks for your reply and I wish you all the best.

hyShen-hzau commented 3 months ago

Hello, after replacing the original line, a similar error still occurs. My server has 64 threads. Is it not possible to set the progress of 40 threads? (METABOLIC_v4.0) [hyShen@zhaolabserver METABOLIC]$ perl METABOLIC-C.pl -in-gn /home/hyShen/Metabolic/Sharp/test1 -r /home/hyShen/Metabolic/test.txt -o result1 -t 40 [2024-08-11 11:45:35] The Prodigal annotation is running... [2024-08-11 13:37:04] The Prodigal annotation is finished readline() on closed filehandle _IN at METABOLIC-C.pl line 1909. [2024-08-11 13:48:17] The hmmsearch is running with 40 cpu threads... Cannot fork: Cannot allocate memory at /home/hyShen/miniconda3/envs/METABOLIC_v4.0/lib/perl5/site_perl/5.22.0/Parallel/ForkManager.pm line 52.

snpone commented 3 months ago

It seems like your .faa files maybe too large, or input files were demaged.

 readline() on closed filehandle _IN at METABOLIC-C.pl line 1909.

## This means function _get_faa_seq()  failed.

# Store faa file into a hash
    %Total_faa_seq = (%Total_faa_seq, _get_faa_seq($file));
    if ($input_genome_folder){
        %Total_gene_seq = (%Total_gene_seq, _get_gene_seq("$file_name\.gene"));
    }   

Possible your faa files too large, to store into RAM. Here is a tiny script to report your avaliable RAM every 30s. You can run in terminal as command bash /PATH/TO/check_momory.sh , then re-run the .pl script.

If you find that there is always ample remaining memory during the program's execution, but you still receive the same error message, then please check if your input file is corrupted. check_memory.sh.gz

hyShen-hzau commented 3 months ago

hmmsearch takes too long to run (7g data over 15h) Hello, after changing the number of threads, my memory space is sufficient. However, this is the third time that the hmmsearch phase takes more than 24 hours, and I only input 7G sequencing data, but when I run -test, it only takes 1.5 hours. I checked the -test data and it is 5G in size, which is not much different from my data size.

snpone commented 3 months ago

This is abnormal. Normally, hmmsearch with 40 threads processing 10GB/sample, for dozens of samples, would take only about a dozen hours. Since I'm not clear about your work environment, I can only guess that it might be a system issue. Try using the glances command or sudo iotop command to check the system disk load. Normal I/O throughput is around 100MBps or even higher. If it's consistently below 50MBps, it could be that another program is occupying the disk or it's a precursor to physical disk damage. There are no other bugs in the program itself. Additionally, if the original input folder contains .fastq.gz files, please decompress them before running the program, which can also save runtime.  Good luck.