Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
182 stars 23 forks source link

Segmentation fault and no recovery possible #222

Open RNieuwenhuis opened 7 months ago

RNieuwenhuis commented 7 months ago

Describe the issue I tried to run RepeatModeler on a very large genome in the TETools singularity container on a machine with 64 cores and over 750 Gb of RAM. I changed the sample size to 1 Gbp to ensure at least a decent amount of my genome is being sampled. The process runs fine for a week or so and reached the eledef stage for round-5 when it exited with code 139 which seems to be a segmentation fault.

99% completed,  00:0:00 (hh:mm:ss) est. time remaining.
      100% completed,  00:0:00 (hh:mm:ss) est. time remaining.
Comparison Time: 62:27:00 (hh:mm:ss) Elapsed Time, 757981572 HSPs Collected
  - RECON: Running imagespread..
RECON Elapsed: 01:34:19 (hh:mm:ss) Elapsed Time
  - RECON: Running initial definition of elements ( eledef )..
eledef failed. Exit code 139

Restarting the pipeline using -recoverDir results in the following message:

Oops...the RM_3967425.WedNov81814392023 run did not get passed round-1.
It makes more sense to restart this run from the beginning.
Remove the -recoverDir option and rerun the program.

I see that for each step the consensi.fa and families.stk files are empty, also in the directories for each round.

ls -l ./
total 2004
-rw-r--r-- 1 nieuw133 domain users       0 Nov  8 18:49 consensi.fa
-rw-r--r-- 1 nieuw133 domain users       0 Nov  8 18:49 families.stk
-rw-r--r-- 1 nieuw133 domain users    7801 Nov 14 12:28 rmod.log
drwxr-xr-x 2 nieuw133 domain users   57344 Nov  8 18:47 round-1
drwxr-xr-x 4 nieuw133 domain users   53248 Nov  8 20:05 round-2
drwxr-xr-x 4 nieuw133 domain users  159744 Nov  9 05:24 round-3
drwxr-xr-x 4 nieuw133 domain users  434176 Nov 11 16:46 round-4
drwxr-xr-x 5 nieuw133 domain users 1318912 Nov 14 09:19 round-5

Reproduction steps

RepeatModeler -database My_genome -threads 64 -LTRStruct -genomeSampleSizeMax 1000000000

Log output

See above

Environment (please include as much of the following information as you can find out):

I used the TETools latest singularity image as is, mounted my directory and used BuildDatabase. No other databases were installed.

What is going on? Why are there no results stored that I can use for recovery?