Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
183 stars 23 forks source link

build_lmer_table failed. Exit code 35072 #136

Open mdebiasse opened 3 years ago

mdebiasse commented 3 years ago

Good morning, I am getting an error message with version 2.0.1 (full out file below):

build_lmer_table failed. Exit code 35072

I am running the program with singularity. This post (https://github.com/Dfam-consortium/RepeatModeler/issues/27) suggests editing the RepeatModeler script to lower the sample size for RepeatScout, but as I understand it, I can't access the scripts outside of the environment since the script builds the environment on the fly and there is no static image being deployed or recalled. Therefore, Im not sure how to access the RepeatModeler script for editing outside of the environment. Any advice is appreciated!

Docker image path: index.docker.io/dfam/tetools:1.3.1
Cache folder set to /home/mdebiasse/.singularity/docker
Creating container runtime...
RepeatModeler Version 2.0.1
===========================
Search Engine = rmblast 2.11.0+
Dependencies: TRF 4.09, RECON 1.08, RepeatScout 1.0.5, RepeatMasker 4.1.2
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1619459304
Database = /mnt/derm .
  - Sequences = 22
  - Bases = 370782577
Using output directory = /home/mdebiasse/RM_1924.MonApr261749552021
Storage Throughput = poor ( 108.07 MB/s )
  - NOTE: Poor storage througput will have a large impact on RepeatModeler
          performance.  The low throughput observed above may be due to
          transient usage patterns on the system and may not reflect the
          actual system performance. Whenever possible run RepeatModeler
          in a directory stored on a fast local disk and not over a
          network filesytem.

Ready to start the sampling process.
INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly
      and the repetitive content of the sequences.  It is not imperative
      that RepeatModeler completes all rounds in order to obtain useful
      results.  At the completion of each round, the files ( consensi.fa, and
      families.stk ) found in:
      /home/mdebiasse/RM_1924.MonApr261749552021/
      will contain all results produced thus far. These files may be
      manually copied and run through RepeatClassifier should the program
      be terminated early.

RepeatModeler Round # 1
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 40000000 bp
   - Final Sample Size = 40037417 bp ( 40033117 non ambiguous )
   - Num Contigs Represented = 22
   - Sequence extraction : 00:04:10 (hh:mm:ss) Elapsed Time
 -- Running RepeatScout on the sequences...
   - RepeatScout: Running build_lmer_table ( l = 14 )..
build_lmer_table failed. Exit code 35072
slurmstepd: error: Detected 1 oom-kill event(s) in step 2092087.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
jebrosen commented 3 years ago

build_lmer_table failed. Exit code 35072 slurmstepd: error: Detected 1 oom-kill event(s) in step 2092087.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

This one is a pretty clear-cut out-of-memory error, but you might not need to change any files to try and troubleshoot it (and we never did find out if that attempted fix worked, anyway). RepeatModeler uses a sampling approach, so it is possible that you got an "unlucky" sequence sample in the first round that is more memory-intensive than normal. It is also possible that other jobs were also using too much memory on the same compute node, and yours might have run just fine if it was not competing with other jobs for resources.

If you haven't already, I would first try running RepeatModeler again - possibly on a different compute node with more memory, or at a less busy time of day, if that's an option. If that doesn't work or if you have already tried a few times and they ended in out-of-memory, I'll look into workarounds for editing the scripts to try that approach next.

mdebiasse commented 3 years ago

Thank you for the reply! I requested an exclusive node with 250G and I think this solved the problem- unfortunately, the run timed out, but the program got past the point where it failed before. I just relaunched with a longer wall time.