Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

How to reduce the number of batches? #10

Closed HHHit closed 6 years ago

HHHit commented 6 years ago

I ran RepeatMasker on one fasta file with 4.2Gb, and while running, the progress shows

identifying most interspersed repeats in batch 1 of 35529
identifying long interspersed repeats in batch 1 of 35529
identifying ancient repeats in batch 1 of 35529
identifying retrovirus-like sequences in batch 1 of 35529

I saw some other users, they only have one batch, why do I have that many batches, which is really slow. Also, using one core would cost me more than 50Gb memory, how can I run it in a faster way? And how could I reduce the number of the batches? Thanks,

Search Engine: NCBI/RMBLAST [ 2.6.0+ ]
Master RepeatMasker Database: /usr/local/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127 )
HHHit commented 6 years ago

Using a large -frag value will decrease the batch size. I used -frag 60000000, the number of batches is decreased to 36.

xiekunwhy commented 2 years ago

Hi @HHHit ,

Using a large -frag may save running time or just decrease batch number?

Best, Kun

rmhubley commented 2 years ago

The RepeatMasker batching system is designed for use with the multi-threaded option (-pa). If you have multiple cores I would recommend using '-pa #', where the value represents the number of batches to run in parallel. Changing the batch size is fine but depending on the search engine used, may not have much of an impact on runtime.