Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
194 stars 22 forks source link

Clustering step of LTR pipeline fails #260

Open sjd028 opened 3 weeks ago

sjd028 commented 3 weeks ago

Describe the issue

When running RepeatModeler, I am consistently getting the same issue at the clustering step of the LTR Pipeline. I am getting essentially the same error message as in #241, however when I try shortening the sequence identifiers (I have also tried shortening the genome name, and the database name) to less than 13 characters as described in #241, I am still getting the same exact issue.

I have tried using three different genomes, all of which are giving me the same error. The RECON/ RepeatScout pipeline seems to be working fine, and I am getting a -families.fa file which has the consensus families excluding LTRs.

This is the error report I am getting in the stderr file: _LTRPipeline : Error - could not open /home/sjd028/RepeatModelerTesting/AterTest/RM_1178777.SatOct51620362024/LTR2708924.WedOct91432322024/clusters.dat! at /opt/RepeatModeler/LTRPipeline line 333.

This is the error I am getting in the stdout file: _LTR Structural Analysis

Running LtrHarvest... : 00:35:17 (hh:mm:ss) Elapsed Time Running Ltrretriever... : 00:43:56 (hh:mm:ss) Elapsed Time Aligning instances... : 00:04:37 (hh:mm:ss) Elapsed Time Clustering...LTRPipeline: Error - could not cluster MAFFT results. : 00:00:00 (hh:mm:ss) Elapsed Time LTRPipeline Time: 01:23:53 (hh:mm:ss) Elapsed Time

Reproduction steps I ran RepeatModeler as a singularity on a computing cluster, giving the job 8 cores at 16Gb per core. This is the command I used: singularity run $dfam RepeatModeler -database AterDbTest1 -threads 20 -LTRStruct

I tried three different genomes: Drosophila melanogaster: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_000778455.1/ Abscondita terminalis (firefly): https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_013368085.1/ Lamprigera yunanna (firefly): https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_013368075.1/

Log output File structure output:

image

AterDbTest1-rmod.log: AterDbTest1-rmod.log slurm (computing cluster job manager) output file: slurm.hpc-4.272297.stdout.txt

Host system This was run on a computing cluster on a linux operating system. More info: LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: Rocky Description: Rocky Linux release 8.9 (Green Obsidian) Release: 8.9 Codename: GreenObsidian

Singularity version: apptainer version 1.3.1-1.el8 The singularity container was downloaded on July 2, 2024

sjd028 commented 1 week ago

Additional info about host system:

The Dfam TETools container was installed using singularity. The version of RepeatModeler is 2.0.5. The version of the TETools package is 1.88.

rmhubley commented 1 day ago

First of all, you are allocating 8 cores for your job but telling RepeatModeler it has access to 20. While I am surprised your job wasn't killed sooner when it was running rmblast, it could be that mafft is overallocating cores and the job is getting killed. MAFFT is memory intensive, I would double check that you are actually giving your jobs 8x16GB, which should be adequate, but perhaps you are giving it less than that? Finally, you can rerun the LTR analysis separately for testing purposes like so: "./LTRPipeline -debug -threads # genome.fa" (NOTE: you give it the original genome in fasta format for this command ). This will generate more screen logging of what it is doing at each stage and keep additional files in the LTR_######## output directory.