Open TomHarrop opened 4 years ago
I'm wondering if there's anything I can do with a run of RepeatModeler that is taking a long time with eleredef. Is there a way to skip this step? Or to tell how long it's going to take, or work out how far through it is?
Unfortunately not - RECON is not actively maintained, and it would take some knowledge of the underlying algorithm to know if skipping that step is possible or to add time estimates. But it is unusual for it to take this long.
But eleredef has been running on seqnames for about two weeks, using >100 GB of RAM
508548972 HSPs Collected
500 million HSPs seems very high to me, even in round 5 - for comparison, a run we have done on hg38 found 2 million HSPs in round 5. Is your genome particularly repeat-rich or otherwise interesting?
Did other rounds find similarly high numbers of HSPs, or only this one? It is also possible you were unlucky and got a "bad" (overly complex to analyze) portion of the genome during sampling and that a new run with a different seed would be just fine.
Thanks for the reply!
Yep, it's repeat rich. One run that did complete on another assembly of the same genome found 80% repeats, and it's a 1.2 GB genome. For that run I used RepeatModeler / RepeatMasker that I had manually installed in a Singularity container, rather than using dfam/tetools
.
I don't know how many HSPs there were in round 4, because it's a -recoverDir
run and I've foolishly overwritten the round 1–4 log file with this log. Is there another way to find that information in the round-4 directory?
You can currently count the number of lines in the file round-4/msps.out
to get that same number.
Right, it's much less in round 4.
$ wc -l msps.out
85469 msps.out
Can you post the rest of the log file (whatever you do have)? And another thought - since you are running the TETools container, can you provide the command line you used to start the container as well?
It's launched via snakemake, which activates the singularity image. Ends up being a pretty complicated command:
singularity exec \
-B ${PWD} \
-H $(mktemp -d) \
--pwd ${PWD} \
--containall --cleanenv --writable-tmpfs \
.snakemake/singularity/c473ef0ae582f9008f27da05ffafc1cf.simg
cd /path/to/output/095_repeatmasker/flye_assemble || exit 1 ; \
RepeatModeler \
-database flye_assemble \
-engine ncbi \
-pa 144 \
-dir /path/to/output/095_repeatmasker/flye_assemble \
-recoverDir /path/to/output/095_repeatmasker/flye_assemble \
&> /path/to/output/logs/rm_model.flye_assemble.log
The singularity container is just a local image pulled from dfam/tetools:1.1
, with TRF added like so:
# RM is configured for trf to be in /opt/trf
wget \
-O /opt/trf \
http://tandem.bu.edu/trf/downloads/trf409.linux64
chmod +x /opt/trf
# allow writing to RM Library dir
chmod -R 777 /opt/RepeatMasker/Libraries
Here's the log (minus a few thousand lines showing the progress)
RepeatModeler Version 2.0.1
===========================
Search Engine = rmblast 2.10.0+
Dependencies: TRF 4.09, RECON 1.08, RepeatScout 1.0.5, RepeatMasker 4.1.0
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1583637913
Database = flye_assemble .......
- Sequences = 61446
- Bases = 1753653027
- N50 = 87563
- Contig Histogram:
Size(bp) Count
-----------------------------------------------------------------------
2008114-2151516 | [ 1 ]
1864713-2008114 | [ ]
1721312-1864713 | [ ]
1577911-1721312 | [ ]
1434510-1577911 | [ 1 ]
1291109-1434510 | [ 1 ]
1147708-1291109 | [ 1 ]
1004307-1147708 | [ 1 ]
860906-1004307 | [ 2 ]
717505-860906 | [ 10 ]
574104-717505 | [ 21 ]
430703-574104 | [ 63 ]
287302-430703 | [ 280 ]
143901-287302 |* [ 1851 ]
500-143901 |************************************************** [ 59214 ]
** RECOVERING /path/to/output/095_repeatmasker/flye_assemble **
- Attempting to rerun round 5
- Backing up previous /path/to/output/095_repeatmasker/flye_assemble/consensi.fa file
- Backing up previous /path/to/output/095_repeatmasker/flye_assemble/round-5 directory.
- Recalculating sample size...( please be patient )
Storage Throughput = fair ( 637.82 MB/s )
Ready to start the sampling process.
INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly
and the repetitive content of the sequences. It is not imperative
that RepeatModeler completes all rounds in order to obtain useful
results. At the completion of each round, the files ( consensi.fa, and
families.stk ) found in:
/path/to/output/095_repeatmasker/flye_assemble/
will contain all results produced thus far. These files may be
manually copied and run through RepeatClassifier should the program
be terminated early.
RepeatModeler Round # 5
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 243000000 bp
- Sequence extraction : 00:00:14 (hh:mm:ss) Elapsed Time
-- Running TRFMask on the sequence...
14369 Tandem Repeats Masked
- TRFMask time 00:10:15 (hh:mm:ss) Elapsed Time
-- Sample Stats:
Sample Size 243005492 bp
Num Contigs Represented = 11228
Non ambiguous bp:
Initial: 243003492 bp
After Masking: 240310928 bp
Masked: 1.11 %
-- Input Database Coverage: 322096552 bp out of 1753653027 bp ( 18.37 % )
Sampling Time: 00:10:57 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
0% completed, 5357:18:04 (hh:mm:ss) est. time remaining.
...
99% completed, 00:0:20 (hh:mm:ss) est. time remaining.
100% completed, 00:0:00 (hh:mm:ss) est. time remaining.
Comparison Time: 99:20:32 (hh:mm:ss) Elapsed Time, 508548972 HSPs Collected
- RECON: Running imagespread..
RECON Elapsed: 02:32:08 (hh:mm:ss) Elapsed Time
- RECON: Running initial definition of elements ( eledef )..
RECON Elapsed: 70:19:52 (hh:mm:ss) Elapsed Time
- RECON: Running re-definition of elements ( eleredef )..
I was worried about TRF, but it looks like you took care of it. @rmhubley might have a better idea, but all I can think of now is to try it again and see if you have better luck with different samples being chosen from the genome.
Hi @jebrosen, just an update - I've tried this 14 times now with variations of this genome and I always get > 500 M HSPs at round 5. In the latest batch of attempts I filtered out contigs < 50 kb, in case the large number of short contigs was causing the issue, but I'm still seeing the same thing.
In this step I run five days and have not been completed! This is should be updated for RepeatModeler
Hi,
I'm wondering if there's anything I can do with a run of RepeatModeler that is taking a long time with eleredef. Is there a way to skip this step? Or to tell how long it's going to take, or work out how far through it is?
I'm running RepeatModeler from the Dfam TE Tools Container (
dfam/tetools:1.1
) on a 1.2 GB genome like this:But eleredef has been running on seqnames for about two weeks, using >100 GB of RAM, e.g.
(That's 17.8 thousand minutes, i.e. 1.8 weeks.)
Here's the output from RepeatModeler:
I don't really know what eleredef is doing, but seqnames is 12,315 lines and there are 6622 batch-*.fa files in the round-5 folder. There are currently 143,068 files in round-5/ele_redef_res.
Thanks!