Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
183 stars 23 forks source link

RepeatModeler runs for a long time #143

Open munegowda opened 3 years ago

munegowda commented 3 years ago

I have recently installed RepeatModeler2 from https://www.repeatmasker.org/RepeatModeler/RepeatModeler-2.0.1.tar.gz and I do not get any errors when I run it. But even after running for more than a week with any genome assembly, it does not finish. For example, I'm using RepeatModeler2 with the racoon dog assembly (renamed as HLnycPro4) from NCBI: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/905/146/905/GCA_905146905.1_NYPRO_anot_genome/GCA_905146905.1_NYPRO_anot_genome_genomic.fna.gz and is run with the following command: RepeatModeler -pa 8 -engine ncbi -database HLnycPro4 &>>log.model.txt this has run for more than 6 days and has not yet finished. Here is the output log file log.model.txt. I'm unable to figure out why it takes an unusually long amount of time to run. I would be grateful to get any help in resolving this issue.

Thanks

jebrosen commented 3 years ago

I'm unable to figure out why it takes an unusually long amount of time to run.

Why do you say unusually? RepeatModeler can take quite a long time to run, although it varies between assemblies and machines.

I did notice that your log file contains two runs, the first one stopped pretty early on. The second run shows much worse disk throughput, though:

Using output directory = /beegfs/scratch/tmp/cmunegowda/TEMP_HLnycPro4_modelAndMask/RM_3134750.WedMay121145292021
Storage Throughput = fair ( 697.47 MB/s )
(...)
Using output directory = /beegfs/scratch/tmp/cmunegowda/TEMP_HLnycPro4_modelAndMask/RM_3309014.SatMay151153452021
Storage Throughput = poor ( 174.71 MB/s )
 - NOTE: Poor storage througput will have a large impact on RepeatModeler
         performance.  The low throughput observed above may be due to
         transient usage patterns on the system and may not reflect the
         actual system performance. Whenever possible run RepeatModeler
         in a directory stored on a fast local disk and not over a
         network filesytem.

RepeatModeler should indeed perform much better when run on a local filesystem instead of a network filesystem, if that is an option available to you.

MichaelHiller commented 3 years ago

Dear Jeb, thanks for your help. We were running an older RepeatModeler version (1.0.8) on a high-performance file system (lustre) and it typically finished within 1-2 days. Now have we have the latest RepeatModeler version and we run it on another high-performance file system (BeeGFS), which in principle should be even more performant than a local file system on a single disk. However, it doesn't finish after a week or so.

We will test running it on the local disk, but I am not sure that is the issue.

Can you think of something else why it doesn't finish in a reasonable amount of time (few days, not weeks)?

Thanks again Michael

jebrosen commented 2 years ago

It comes as a bit of surprise to me that a network filesystem could ever be faster than the local disk - but it seems I haven't noticed just how fast Ethernet speeds have gotten.

Although in this case, the observed performance of 174.71 MB/s in this particular run is pretty poor. On a few machines I tested, local disks were around 1700 MB/s and an NFS filesystem was around 100 MB/s.