Closed sherlock0088 closed 1 month ago
Hi,
The runtime of RepeatModeler is wholly dependent on the repeat content of the genome, and is the rate limiting step of the whole pipeline. For example, the RepeatModeler2 manual states a runtime of 37 hours and 23 minutes for O sativa, which is 375Mb (https://www.repeatmasker.org/RepeatModeler/). De novo repeat detection is computationally intensive, leading to these long runtimes.
You can try running single chromosomes at a time, however this will lead to a bias in repeat detection. De novo repeat detection works by comparing the whole input sequence to itself and identifying repeated regions. If a single chromosome is used, you are likely to detect high copy-number repeats, but will likely miss low copy-number TEs, especially if they are not found in several copies on the same chromosome. If this suits your requirements, then you can run the pipeline this way, but you will need to cluster your final libraries to cross-reference families between chromosomes as RepeatModeler family names are randomly generated.
One way to save time doing this would be to use earlGreyLibConstruct
in the latest version of Earl Grey, which will generate a library but not run the annotation step. You could then do this for each chromosome, combine the libraries, then run repeatmasker afterwards on the whole genome to generate your final annotation.
Thanks, I will try and return to you
closing due to lack of activity - feel free to reopen or initiate a new issue if more help is needed!
Hi,
I am currently running EarlGrey for a gymnosperm genome (>10 Gbps), and the process has been running for over 7 days on 48 cores. However, our HPC has a running time limitation of 7 days, which means the job has not completed.
I would appreciate any suggestions you might have for reducing the running time. Specifically, would running EarlGrey on a chromosome-by-chromosome basis affect the reliability of the final output?
Yupeng