Running repeat modeler for 7+ days on Cluster

RajvParvathaneni commented 2 years ago

Describe the issue

I am running RepeatModeler using the university cluster with my fungal genome (~46 MB). I used the following syntax below. It completed 2 rounds and still running. How long does it take? Or do I need to modify my syntax to work faster. Any guidance is appreciated.

BuildDatabase -name E2 -engine ncbi xxx.fa RepeatModeler -engine ncbi -pa 3 -database E2 > E2-repeat.out

A concise description of the bug, including any error messages.

Reproduction steps

Steps to reproduce the behavior, including the command lines given to the program

and links to publicly available genome assemblies and other data files (if available).

Log output

Please paste or attach any and all log output, which includes useful information including data file statistics and version numbers. An easy way to capture this is to redirect the log output to a file e.g RepeatModeler -database mydb >& output.log. The log output should include the "random seed" value at the start of the run. This number will be necessary in order to reproduce the run exactly.

Environment (please include as much of the following information as you can find out):

How did you install RepeatModeler? e.g. manual installation from repeatmasker.org, bioconda, the Dfam TE Tools container, or as part of another bioinformatics tool?
Which version of RepeatModeler do you have? The output of RepeatModeler without any options will be a help page with the version of the program displayed at the top.
Which version of RepeatMasker is this RepeatModeler installation using? Have you installed RepBase RepeatMasker Edition for RepeatMasker, or the full Dfam database?
Operating system and version. The output of uname -a and lsb_release -a can be used to find this.

Additional context

Add any other context you have about the problem here. Some possible examples:
- If an older version of RepeatModeler worked before
- If the problem only happens with specific data files

ChristophePatterson commented 2 years ago

Hi Rajiv,

I was wondering if you found a solution for this? I'm having a similar problem of running repeatmodeler2 on a cluster with a genome of 1.4Gb. Specifically, round 5 has an estimated duration of >200 hours, longer than my cluster will allow me to run jobs on. There appear to be a few other issues that have been raised with people having the same issue.

In #158, @jebrosen appeared to resolve the problem by including export BLAST_USAGE_REPORT=false. This hasn't resolved the problem for me, unfortunately.

Any advice would be greatly appreciated.

Kind regards,

Christophe

cinnamon259 commented 1 year ago

Were either of you able to fix this? I am also running into this issue running on a cluster with a genome size of 1.1 Gb. The estimated time for round 6 is around 400 hours. Would also appreciate any advice or insight.

Thank you,

Cinnamon

JKing2000 commented 1 year ago

For me, using repeatmodeler installed with conda significantly reduces run time compared to using repeatmodeler that has been installed manually. These were run on the same university system, on the same genome fasta file and using same random seed. Not sure what the reasoning is behind this but would love to know.

Dfam-consortium / RepeatModeler

Running repeat modeler for 7+ days on Cluster #163