TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
131 stars 19 forks source link

Out of Memory but the job seemed already finished #143

Open ting-hsuan-chen opened 2 days ago

ting-hsuan-chen commented 2 days ago

Hello!

I ran EarlGrey (v4.4.4) for multiple genomes (size between 500-600 Mb) using Slurm. Some jobs were completed but the others showed Out Of Memory (exit code 0).

For those OOM jobs, I checked the log file generated by earlGrey and it seemed that the pipeline has completed. Like the following:

       (   ) )
       ) ( (
     _______)_
  .-'---------|  
 ( C|/\/\/\/\/|
  '-./\/\/\/\/|
   '_________'
    '-------'
  <<< TE library, Summary Figures, and TE Quantifications in Standard Formats Can Be Found in ../01.04_hy_v4h2_EarlGrey/01.04_hy_v4h2_summaryFiles/ >>>

And the number of files in the summary folder is the same as those genomes with completed run.

ls -l 01.04_hy_v4h2_EarlGrey/01.04_hy_v4h2_summaryFiles/
total 175840
-rw-rw-r--. 1 cflthc powerplant      7979 Sep 26 06:18 01.04_hy_v4h2_classification_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant    624387 Sep 26 06:18 01.04_hy_v4h2_divergence_summary_table.tsv
-rw-rw-r--. 1 cflthc powerplant   5679703 Sep 26 06:18 01.04_hy_v4h2-families.fa.strained
-rw-rw-r--. 1 cflthc powerplant    306155 Sep 26 01:58 01.04_hy_v4h2.familyLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant  38064532 Sep 26 06:18 01.04_hy_v4h2.filteredRepeats.bed
-rw-rw-r--. 1 cflthc powerplant 111409661 Sep 26 06:18 01.04_hy_v4h2.filteredRepeats.gff
-rw-rw-r--. 1 cflthc powerplant       542 Sep 26 01:58 01.04_hy_v4h2.highLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant      9184 Sep 26 06:18 01.04_hy_v4h2_split_class_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant      8157 Sep 26 01:58 01.04_hy_v4h2.summaryPie.pdf
-rw-rw-r--. 1 cflthc powerplant     12325 Sep 26 06:18 01.04_hy_v4h2_superfamily_div_plot.pdf

What would be the cause of the OOM error? Which step is the most RAM-consuming step? Should I rerun EarlGrey for those having OOM error or ignore the OOM error? Or would it be a problem caused by our Slurm system instead?

p.s. I used 16 cores and 60G of RAM for each job.

Any guidance is much appreciated.

Cheers Ting-Hsuan

ting-hsuan-chen commented 2 days ago

Update: I compared the two runs for the same genome. The first run was given 50G of RAM but failed with OOM. The second run was given 60G of RAM but completed. And I found that the size of the ouput files (especially the TE library and bed/gff files) in the summary folder was larger than those of the OOM run. So I guess I'll need to rerun earlGrey for the failed genome.

The file content in the summary folder of OOM run:

total 176712
-rw-rw-r--. 1 cflthc powerplant      7630 Sep 22 13:19 01.01_red5_v2_classification_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant    619087 Sep 22 13:19 01.01_red5_v2_divergence_summary_table.tsv
-rw-rw-r--. 1 cflthc powerplant   5795045 Sep 22 13:19 01.01_red5_v2-families.fa.strained
-rw-rw-r--. 1 cflthc powerplant    303431 Sep 22 09:18 01.01_red5_v2.familyLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant  37979930 Sep 22 13:19 01.01_red5_v2.filteredRepeats.bed
-rw-rw-r--. 1 cflthc powerplant 112088923 Sep 22 13:19 01.01_red5_v2.filteredRepeats.gff
-rw-rw-r--. 1 cflthc powerplant       489 Sep 22 09:18 01.01_red5_v2.highLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant      8524 Sep 22 13:19 01.01_red5_v2_split_class_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant      7878 Sep 22 09:18 01.01_red5_v2.summaryPie.pdf
-rw-rw-r--. 1 cflthc powerplant     12786 Sep 22 13:19 01.01_red5_v2_superfamily_div_plot.pdf

The file content in the summary folder of completed run:

total 175776
-rw-rw-r--. 1 cflthc powerplant      7674 Sep 25 21:23 01.01_red5_v2_classification_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant    623389 Sep 25 21:23 01.01_red5_v2_divergence_summary_table.tsv
-rw-rw-r--. 1 cflthc powerplant   6183721 Sep 25 21:23 01.01_red5_v2-families.fa.strained
-rw-rw-r--. 1 cflthc powerplant    305545 Sep 25 16:16 01.01_red5_v2.familyLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant  37714153 Sep 25 21:23 01.01_red5_v2.filteredRepeats.bed
-rw-rw-r--. 1 cflthc powerplant 111215427 Sep 25 21:23 01.01_red5_v2.filteredRepeats.gff
-rw-rw-r--. 1 cflthc powerplant       489 Sep 25 16:16 01.01_red5_v2.highLevelCount.txt
-rw-rw-r--. 1 cflthc powerplant      8563 Sep 25 21:23 01.01_red5_v2_split_class_landscape.pdf
-rw-rw-r--. 1 cflthc powerplant      7880 Sep 25 16:16 01.01_red5_v2.summaryPie.pdf
-rw-rw-r--. 1 cflthc powerplant     12335 Sep 25 21:23 01.01_red5_v2_superfamily_div_plot.pdf

Is there a way to resume earlGrey from where it failed?

TobyBaril commented 1 day ago

Hi @ting-hsuan-chen!

In this case it is likely that the OOM step prevented proper processing during the divergence calculations, where the annotations are read into memory to calculate kimura divergence. It is probably worth rerunning these jobs just to make sure.

You can rerun the failed steps of EarlGrey here by deleting ${OUTDIR}/${species}_mergedRepeats/looseMerge/${species}.filteredRepeats.bed, then resubmitting the job with exactly the same command line options as before. EarlGrey will then skip stages that are successfully completed, so in this case it should only rerun the defragmentation step and divergence calculations

ting-hsuan-chen commented 1 day ago

Thank you @TobyBaril, I'll try it.