Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
182 stars 23 forks source link

Local RepeatModeler and RepeatMasker installation slower than conda install #217

Closed drs closed 8 months ago

drs commented 10 months ago

Describe the issue

Local install of RepeatModeler and RepeatMasker appears to be much slower than expected. This includes every steps of the repeat library construction and masking (including BuildDatabase, RepeatModeler and RepeatMasker). Repeat library construction for a 20 Mb genome took over 3 days on a 192 Gb RAM, 56 cores (dual E5-2680V4) system with very little resource usage at every step. Using a conda installed RepeatModeler, the repeat library construction took 29m26s. System load average remained lower than 1 during the execution of the locally installed RepeatMasker (I did not monitor RepeatModeler, but suspect that the resource usage will be similar).

I replicated the issue using Yeast ChrVII (NC_001139.9). For example, the BuildDatabase command takes 0.123s with conda version and 3m21.891s with the local install. RepeatModeler takes 37s with conda install and 2h25 with the local install.

Reproduction steps

BuildDatabase -name NC_001139.replib.fasta -engine ncbi ../NC_001139.fasta
RepeatModeler -database NC_001139.replib.fasta -threads 20

I can reproduce the issue with Saccharomyces cerevisiae chromosome VII (https://www.ncbi.nlm.nih.gov/nuccore/NC_001139.9).

Log output

RepeatModeler-Local.log RepeatModeler-Conda.log

Environment:

Additional context

I have observed the same issue with a prior installation of RepeatModeler (v2.0.3) and RepeatMasker (v4.1.4) that was working fine until now.

The issue appears to be similar to what is described by @JKing2000 in https://github.com/Dfam-consortium/RepeatModeler/issues/163#issuecomment-1696654025 so I decided to open an issue. I am not sure if it's a bug or an issue with my install or Linux system.

rmhubley commented 10 months ago

Thank you for the wealth of details in this report. Since both Conda/Local runs appear to be using the same versions of the software packages there shouldn't be a significant difference in timing as you point out. One thing I noticed from your two logs is that there was a difference in the disk writing throughput between the slow and faster runs. I realize that you used the same filesystem in both tests however I wonder if this might have been a transient problem ( e.g NFS lag, or network traffic or system load etc ). Is /analysis/TE directory on a local filesystem? Another helpful way to test would be to give both runs the same random number generator seed ( e.g make up a number and perform both runs with -srand mynumber ). Finally, as you run each you might want to run "top" with a fast update time to monitor the cpu utilization of the program (and it's invocations) to make sure they are getting generating similar CPU loads.

JKing2000 commented 10 months ago

I think we have the same issue.

I suspect it may have something to do with rmblast. I tried changing my local repeatmodeler to use rmblast from conda, and run time improved significantly compared to using my local rmblast. Although repeatclassifier is still a bit slower compared to using conda. (see log files below, very similar run times from both local and conda) I simply changed the RMBLAST_DIR in the RepModelConfig.pm configuration file to the conda rmblast bin. Everything else is still pointing to the local bins (Repeatmasker, TRF etc)

Here's the log files. conda.log local-with-conda-rmblast.log

rmhubley commented 10 months ago

How was RMBlast installed locally on your systems? Was it compiled from source or did you use one of our provided binary distributions? If the latter could you provide the md5sum of the rmblastn binary ( e.g. "md5sum rmblast/bin/rmblastn" ).

JKing2000 commented 10 months ago

I tried both (pre-compiled version and compiling from source version), also tried 2.13.0 and 2.14.0. Repeatmodeler run using any of them were significantly slower (I did not complete these runs, they were terminated early as they were already taking atleast 5x longer just to generate the RM_workingdir folder).

The md5sum for rmblastn 2.14.0 I have is: 68c323efb2090f9a84f087e8972a068e

rmhubley commented 10 months ago

This is extremely helpful. Would you mind comparing run times between the local/conda rmblast packages using this script: http://www.repeatmasker.org/~rhubley/test.sh

You will need to edit the top of the script to set the directories to the local/conda installations and copy the example NC_001139.9 sequence to "NC_001139.9.fa" in the same directory.

If you run this several times (just to get average runtimes) hopefully we can eliminate/implicate rmblast as the culprit. Also, does the conda rmblastn have the same md5sum signature?

JKing2000 commented 10 months ago

I updated conda's installation so it's now rmblast 2.14.0 instead of 2.13.0, the md5sum for rmblastn in conda's installation is 62eb41652f25265dc7b77095c2eeb37d, not matching the one from local.

I tested your script 5 times and here are the outputs: test1.txt test2.txt test3.txt test4.txt test5.txt

Doesn't seem like there is much of a difference. However, I noticed that today, for whatever reason, the local Repeatmodeler (using local rmblast) decided to run faster all of a sudden (atleast started to generate the RM_workingdir in a reasonable amount of time) but the whole run is still slower than conda. Here are the logs I ran today for comparison, ~4min (local) compared to ~1min (conda) and ~1min (local configurated to conda rmblast). (yesterday it wasn't even generating the RM_workingdir after 5-10mins using local rmblast). conda.log local.log local-with-conda-rmblast.log

drs commented 10 months ago

My conda rmblastn has a different md5sum : 9f13eb95fa97d74b15f44d4ac65709de. Version 2.14.0 (build h4565617_2 from bioconda channel).

I ran the test script twice and got similar result both time : test.log

rmhubley commented 9 months ago

Unfortunately, I cannot reproduce this problem with either version of RepeatModeler/RMBlast. Do you share this server with other users? Perhaps there is an intermittent IO/CPU load that is coincidentally throwing off the comparisons? The example you are running also might be a little too small to get good average runtime values.

drs commented 8 months ago

Thank you for your help. The computer I use is not shared with other users at this time, and I replicated the issue multiple times with various datasets (including 20-50 Mb genomes) so I am fairly confident that there was some issue. But, since the RepeatModeler/RepeatMasker module was working perfectly properly for some time, then became very slow (likely after a system update) I think that this is something very specific.

It is solved after re-installed the latest version of the tools from scratch (RepeatModeler v2.0.5, RepeatMasker v4.1.5 and RMblast v2.14.1 [locally compiled]) so I will close this issue.