TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
130 stars 19 forks source link

RepeatModeler always runs with 1 thread, does not reuse earlier round results #131

Closed mjacksonhill closed 4 days ago

mjacksonhill commented 4 weeks ago

I'm running EarlGrey on my chromosome-level plant assembly with a size of 890Mbp. Estimated repeat content from the first few runs of RepeatModeler is around 60%.

I am using the latest singularity version built with DFam 3.7, and executing on a slurm cluster.

My problem is two-fold-- first, my cluster has a job time limit of 5 days. At five days, RepeatModeler is halfway through round 5. When I restart the run, RepeatModeler starts over at round 1! When I restart, I see two unique RepeatModeler folders created in the RepeatModeler folder. Is there a way to make RepeatModeler reuse results from previous runs? That's the only way I can get the analysis to finish given a 5-day job time limit.

My other problem is that RepeatModeler only ever says it is running with a single thread (see below). Does this just apply for the very first stage, and it correctly utilizes the supplied cores for the rest? I am not sure if it makes sense that a highly contiguous genome of this size would take this long to run, so I want to make sure I'm not missing something.

Round 4, on 130Mbp out of 890Mbp (14.6%), took 21 hours on 64 cores with 512GB memory.

RepeatModeler Version 2.0.5
===========================
Using output directory = /workdir/earlgrey/m_canadense_EarlGrey/m_canadense_RepeatModeler/RM_352667.MonAug120602292024
Search Engine = rmblast 2.14.1+
Threads = 1
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1723456948
Database = /workdir/earlgrey/m_canadense_EarlGrey/m_canadense_Database/m_canadense .
  - Sequences = 26
  - Bases = 890548923
Storage Throughput = fair ( 449.61 MB/s )

My cluster submission command is:

module load singularity
cd /work/wenglab/playground/matthew/genome

sbatch -p long -t 5-0 -n 64 --mem=512gb --wrap \
"singularity run -B $(pwd):/workdir /work/wenglab/testtube/matthew/singularity/earlgrey_latest.sif \
earlGrey -g /workdir/genome_cs.unmasked.fasta -s m_canadense -o /workdir/earlgrey -m -t 64"

Here's the output log from the run I did.

TobyBaril commented 3 weeks ago

Hi,

The rate limiting step of Earl Grey is indeed RepeatModeler. Whilst the runtime can be improve with more cores and storage throughput, the biggest factor is the repeat content of the genome being analysed. RepeatModeler is a pipeline that runs a few different repeat detection softwares, the first of which is RepeatScout to form a start point. This one stage is single-threaded, before the later steps are multi-threaded, so it can look like RepeatModeler is only using a single thread for the first stage.

There used to be a -recoverDir option in legacy RepeatModeler, but this never really worked to recover an interrupted run and has (as far as I know from the developers) been deprecated in the current version. Unfortunately, this means the only option with a failed run is to restart from the beginning, as RepeatModeler uses random sampling each time.

As a reference point, the human genome takes several days to run on RepeatModeler with 64 high-frequency cores (~4.5GHz), some large fungal genomes I have had running for ~4-5 days on 128 cores, the tuatara genome took ~45 days on 64 cores in early testing.

One potential solution is to modify line 143 in the earlGrey script and to add -genomeSampleSizeMax 27000000 which will limit RepeatModeler to 4 rounds. However be aware that due to the subsampling approach of RepeatModeler this could result in missing lower copy number TE families, although it should in theory find most of the relatively abundant families in the input genome.

Another solution would be to run RepeatModeler on its own RepeatModeler -engine ncbi -threads ${ProcNum} -database ${OUTDIR}/${species}_Database/${species} using the directories and databases created from the interrupted Earl Grey run, then to resubmit the full Earl Grey command. As long as ${OUTDIR}/${species}_Database/${species}-families.fa exists in the specified Earl Grey output directory (indicating a successful RepeatModeler run), Earl Grey will automatically skip the RepeatModeler step of the pipeline and use the existing family file.

mjacksonhill commented 3 weeks ago

Thanks for the insights!

Got it-- so hopefully that means everything downstream is running with the 64 cores I gave it. I may try your solution-- 5 days may be enough to complete round 5, which attempts to sample half the genome.

RepeatModeler does also support the -srand flag for supplying a random seed, so in theory the random sampling can be replicated and the run continued if this seed is known and supplied along with -recoverDir.

It also seems as though the recoverdir flag may still be functional from just a cursory check, so specifying -recoverDir and -srand may allow the run to continue. The random seed is noted in the earlGrey log, so I do still have it.

TobyBaril commented 4 days ago

Hi!

Any luck with recovering the RepeatModeler run? Do you require any more help or information?

mjacksonhill commented 4 days ago

I got the run to finish after some time; now I'm waiting for the strainer to finish. I didn't bother recovering the run, but I think to incorporate run recovery into earlGrey you would need to implement a way to pass the -srand and -recoverDir flags for RepeatModeler from the earlGrey command. Honestly that seems like more work than just waiting for the run to finish!