Out of Memory but the job seemed already finished

ting-hsuan-chen commented 2 months ago

Hello!

I ran EarlGrey (v4.4.4) for multiple genomes (size between 500-600 Mb) using Slurm. Some jobs were completed but the others showed Out Of Memory (exit code 0).

For those OOM jobs, I checked the log file generated by earlGrey and it seemed that the pipeline has completed. Like the following:

       (   ) )
       ) ( (
     _______)_
  .-'---------|  
 ( C|/\/\/\/\/|
  '-./\/\/\/\/|
   '_________'
    '-------'
  <<< TE library, Summary Figures, and TE Quantifications in Standard Formats Can Be Found in ../01.04_hy_v4h2_EarlGrey/01.04_hy_v4h2_summaryFiles/ >>>

And the number of files in the summary folder is the same as those genomes with completed run.

ls -l 01.04_hy_v4h2_EarlGrey/01.04_hy_v4h2_summaryFiles/
total 175840
-rw-rw-r--. 1 cflthc powerplant      7979 Sep 26 06:18 01.04_hy_v4h2_classification_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant    624387 Sep 26 06:18 01.04_hy_v4h2_divergence_summary_table.tsv
-rw-rw-r--. 1 cflthc powerplant   5679703 Sep 26 06:18 01.04_hy_v4h2-families.fa.strained
-rw-rw-r--. 1 cflthc powerplant    306155 Sep 26 01:58 01.04_hy_v4h2.familyLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant  38064532 Sep 26 06:18 01.04_hy_v4h2.filteredRepeats.bed
-rw-rw-r--. 1 cflthc powerplant 111409661 Sep 26 06:18 01.04_hy_v4h2.filteredRepeats.gff
-rw-rw-r--. 1 cflthc powerplant       542 Sep 26 01:58 01.04_hy_v4h2.highLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant      9184 Sep 26 06:18 01.04_hy_v4h2_split_class_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant      8157 Sep 26 01:58 01.04_hy_v4h2.summaryPie.pdf
-rw-rw-r--. 1 cflthc powerplant     12325 Sep 26 06:18 01.04_hy_v4h2_superfamily_div_plot.pdf

What would be the cause of the OOM error? Which step is the most RAM-consuming step? Should I rerun EarlGrey for those having OOM error or ignore the OOM error? Or would it be a problem caused by our Slurm system instead?

p.s. I used 16 cores and 60G of RAM for each job.

Any guidance is much appreciated.

Cheers Ting-Hsuan

ting-hsuan-chen commented 2 months ago

Update: I compared the two runs for the same genome. The first run was given 50G of RAM but failed with OOM. The second run was given 60G of RAM but completed. And I found that the size of the ouput files (especially the TE library and bed/gff files) in the summary folder was larger than those of the OOM run. So I guess I'll need to rerun earlGrey for the failed genome.

The file content in the summary folder of OOM run:

total 176712
-rw-rw-r--. 1 cflthc powerplant      7630 Sep 22 13:19 01.01_red5_v2_classification_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant    619087 Sep 22 13:19 01.01_red5_v2_divergence_summary_table.tsv
-rw-rw-r--. 1 cflthc powerplant   5795045 Sep 22 13:19 01.01_red5_v2-families.fa.strained
-rw-rw-r--. 1 cflthc powerplant    303431 Sep 22 09:18 01.01_red5_v2.familyLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant  37979930 Sep 22 13:19 01.01_red5_v2.filteredRepeats.bed
-rw-rw-r--. 1 cflthc powerplant 112088923 Sep 22 13:19 01.01_red5_v2.filteredRepeats.gff
-rw-rw-r--. 1 cflthc powerplant       489 Sep 22 09:18 01.01_red5_v2.highLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant      8524 Sep 22 13:19 01.01_red5_v2_split_class_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant      7878 Sep 22 09:18 01.01_red5_v2.summaryPie.pdf
-rw-rw-r--. 1 cflthc powerplant     12786 Sep 22 13:19 01.01_red5_v2_superfamily_div_plot.pdf

The file content in the summary folder of completed run:

total 175776
-rw-rw-r--. 1 cflthc powerplant      7674 Sep 25 21:23 01.01_red5_v2_classification_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant    623389 Sep 25 21:23 01.01_red5_v2_divergence_summary_table.tsv
-rw-rw-r--. 1 cflthc powerplant   6183721 Sep 25 21:23 01.01_red5_v2-families.fa.strained
-rw-rw-r--. 1 cflthc powerplant    305545 Sep 25 16:16 01.01_red5_v2.familyLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant  37714153 Sep 25 21:23 01.01_red5_v2.filteredRepeats.bed
-rw-rw-r--. 1 cflthc powerplant 111215427 Sep 25 21:23 01.01_red5_v2.filteredRepeats.gff
-rw-rw-r--. 1 cflthc powerplant       489 Sep 25 16:16 01.01_red5_v2.highLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant      8563 Sep 25 21:23 01.01_red5_v2_split_class_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant      7880 Sep 25 16:16 01.01_red5_v2.summaryPie.pdf
-rw-rw-r--. 1 cflthc powerplant     12335 Sep 25 21:23 01.01_red5_v2_superfamily_div_plot.pdf

Is there a way to resume earlGrey from where it failed?

TobyBaril commented 2 months ago

Hi @ting-hsuan-chen!

In this case it is likely that the OOM step prevented proper processing during the divergence calculations, where the annotations are read into memory to calculate kimura divergence. It is probably worth rerunning these jobs just to make sure.

You can rerun the failed steps of EarlGrey here by deleting ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed, then resubmitting the job with exactly the same command line options as before. EarlGrey will then skip stages that are successfully completed, so in this case it should only rerun the defragmentation step and divergence calculations

ting-hsuan-chen commented 2 months ago

Thank you @TobyBaril, I'll try it.

ting-hsuan-chen commented 1 month ago

Hi @TobyBaril, turned out I restarted a fresh run using earlGreyLibConstruct because I only need the TE library files. The OOM error was raised for some genomes, with the end of log files looked the same as I mentioned previously. So I wanted to follow your instruction to rerun earlGreyLibConstruct for them. However there's no folder called ${OUTDIR}/${species}_mergedRepeats. Is this expected? Could you advice on how to resume these jobs using earlGreyLibConstruct? Thanks! :)

TobyBaril commented 1 month ago

Hi @ting-hsuan-chen, the library construction terminates after TEstrainer, where the de novo libraries are generated. It will not run the final annotation and subsequent defragmentation. The idea with this subscript of Earl Grey is to generate the libraries, which can then be combined into a single non-redundant library that can be used to mask all the genomes at the end.

On this, I've made a note to add another subscript for the final annotation and defragmentation for the next release!

ting-hsuan-chen commented 1 month ago

Thank you @TobyBaril ! earlGreyLibConstruct is exactly what I need - we are building a pan-TE library for multiple genomes. I have some follow up questions.

I allocated 10 cpus and a total of 100G of memory for each genome (each about 500-600Mb in size). For some genomes, I still got the "Out Of Memory" error from slurm when using earlGreyLibConstruct. But I didn't find any error message in the log file.

For your reference, the tail of the log file for the slurm job with OOM error was attached. It seemed that TEstrainer step has been completed? If not, how do I resume earlGreyLibConstruct to complete the job? Would the approach mentioned in https://github.com/TobyBaril/EarlGrey/issues/58#issuecomment-1757725110 suite my case?

Trimming and sorting based on mreps, TRF, SA-SSR
Removing temporary files
Reclassifying repeats
RepeatClassifier Version 2.0.5
======================================
  - Looking for Simple and Low Complexity sequences..
  - Looking for similarity to known repeat proteins..
  - Looking for similarity to known repeat consensi..
../01_EGLibConstruct/04.01_poly_v1_EarlGrey/04.01_poly_v1_strainer
Compiling library

              )  (
             (   ) )
             ) ( (
           _______)_
        .-'---------|  
       ( C|/\/\/\/\/|
        '-./\/\/\/\/|
         '_________'
          '-------'
        <<< Tidying Directories and Organising Important Files >>>

              )  (
             (   ) )
             ) ( (
           _______)_
        .-'---------|  
       ( C|/\/\/\/\/|
        '-./\/\/\/\/|
         '_________'
          '-------'
        <<< Done in 86:10:35.00 >>>

              )  (
             (   ) )
             ) ( (
           _______)_
        .-'---------|  
       ( C|/\/\/\/\/|
        '-./\/\/\/\/|
         '_________'
          '-------'
        <<< TE library in Standard Format Can Be Found in ../01_EGLibConstruct/04.01_poly_v1_EarlGrey/04.01_poly_v1_summaryFiles/ >>>

TobyBaril commented 1 month ago

Hi!

You can resume by following the comment. In the case of the log you posted it looks as though the job should have successfully completed. OOM in slurm can be a bit odd though as it terminates the job without giving it a chance to finish nicely, so there's no guarantee the step it interrupted finished properly.

jeankeller commented 1 month ago

Hi,

I've also had the same issue when running EarlGrey on SLURM HPC, and what surprised me is that for the exactly same input file, one run consummed more than 300G of RAM (and generated an OOM error) while the second run ended without error and a maximum consumption of ca. 20Gb of RAM. On large genomes (>600Mb to 3Gb), I systematically obtain OOM errors (different number of tasks being killed for OOM) woth EarlGrey requiring huge amount of RAM (more than 500Gb). However, maskings still finished.

TobyBaril commented 1 month ago

Hi @jeankeller,

This is strange...are you happy to provide the log files from both runs for us to take a look at?

jeankeller commented 1 month ago

Hi @TobyBaril Thanks for your answer. As it was few months ago, the log file of the failed run has been removed. I can still share the one for the run that worked but I'm not sure that will be useful... I was a bit surprised that whatever the genome is in terms of size, there are OOM errors returned by SLURM (slurmstepd: error: Detected 1 oom_kill event in StepId=12315883.batch. Some of the step tasks have been OOM Killed), the number of tasks vary according the run. I can share any logs with you if you want. [EDIT] I just realized that we are running through conda, could it be related to conda?

Best, Jean

TobyBaril commented 1 week ago

Hi Jean,

It shouldn't be related to conda. It might be related to the divergence calculations that run several in parallel, although this shouldn't really cause any issues until we get to files with millions of TE hits...I'll continue trying to narrow this down - it is a strange one as nothing in the core pipeline has changed for several iterations now!

jeankeller commented 1 week ago

Hi Toby,

Yes, it is weird. The HPC team installed EarlGrey as a SLURM module instead of an user-conda environment and on the tests I have run, it looks like the error has gone. I am running more tests on species with different genome size to confirm the pattern. I can share with you the log of the failed run (under conda environment) that used more than 300Gb of RAM. We redid it the exact same way and it consumed only 10-15Gb of RAM. Best Jean

ting-hsuan-chen commented 1 week ago

Hi Toby, It seems to me that the divergence calculations are not the cause. I've been using conda environment and submitting jobs to SLURM, and I only use earlGreyLibConstruct which doesn't include divergence calculation. The huge RAM consumption issue persists. I've run earlGreyLibConstruct on several plant genomes separately, each around 500-600Mb. I kept getting OOM errors for some, and therefore needed to empty the "strainer" folder and resume the analysis with more RAM. Some runs used <150G of RAM, while others needed 300G or more. @jeankeller It's great to know that a SLURM module installation might solve the problem. I'll contact our HPC team and see if that can be done on our side.

Cheers Ting-Hsuan

TobyBaril commented 1 week ago

Okay so this looks like it could be linked to something in TEstrainer, or potentially a conda module. @jamesdgalbraith might be able to provide more information on specific sections of TEstrainer that could be the culprit, but we will look into it

jamesdgalbraith commented 1 week ago

The memory-hungry stages of TEstrainer is the multi-sequence alignments using MAFFT, and the amount of memory used can vary between runs on the same genome depending on several factors including what repeats that particular run of RepeatModeler found (the seed it uses varies), especially if it detects a satellite repeats, as constructing MSAs of long arrays of tandem repeats is very memory-hungry. This may be what you're encountering @jeankeller . Unfortunately I don't currenty have a fix for this, but have been exploring potential ways of overcoming this issue.

In the first jobs you mentions @ting-hsuan-chen I don't think it's OOM in TEstrainer due to the presence of the 01.01_red5_v2-families.fa.strained file in the summary folder. In testing I've found that if TEstrainer causes an OOM error EarlGrey will cease at TEstrainer and not continue with the RepeatMasker annotation and tidy up.

TobyBaril / EarlGrey

Out of Memory but the job seemed already finished #143