Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
194 stars 22 forks source link

Running repeat modeler for 7+ days on Cluster #163

Open RajvParvathaneni opened 2 years ago

RajvParvathaneni commented 2 years ago

Describe the issue

I am running RepeatModeler using the university cluster with my fungal genome (~46 MB). I used the following syntax below. It completed 2 rounds and still running. How long does it take? Or do I need to modify my syntax to work faster. Any guidance is appreciated.

BuildDatabase -name E2 -engine ncbi xxx.fa RepeatModeler -engine ncbi -pa 3 -database E2 > E2-repeat.out

A concise description of the bug, including any error messages.

Reproduction steps

  1. Steps to reproduce the behavior, including the command lines given to the program

Log output

Please paste or attach any and all log output, which includes useful information including data file statistics and version numbers. An easy way to capture this is to redirect the log output to a file e.g RepeatModeler -database mydb >& output.log. The log output should include the "random seed" value at the start of the run. This number will be necessary in order to reproduce the run exactly.

Environment (please include as much of the following information as you can find out):

Additional context

ChristophePatterson commented 2 years ago

Hi Rajiv,

I was wondering if you found a solution for this? I'm having a similar problem of running repeatmodeler2 on a cluster with a genome of 1.4Gb. Specifically, round 5 has an estimated duration of >200 hours, longer than my cluster will allow me to run jobs on. There appear to be a few other issues that have been raised with people having the same issue.

In #158, @jebrosen appeared to resolve the problem by including export BLAST_USAGE_REPORT=false. This hasn't resolved the problem for me, unfortunately.

Any advice would be greatly appreciated.

Kind regards,

Christophe

cinnamon259 commented 1 year ago

Were either of you able to fix this? I am also running into this issue running on a cluster with a genome size of 1.1 Gb. The estimated time for round 6 is around 400 hours. Would also appreciate any advice or insight.

Thank you,

Cinnamon

JKing2000 commented 1 year ago

For me, using repeatmodeler installed with conda significantly reduces run time compared to using repeatmodeler that has been installed manually. These were run on the same university system, on the same genome fasta file and using same random seed. Not sure what the reasoning is behind this but would love to know.