Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
195 stars 22 forks source link

eleredef - long runtime #72

Open TomHarrop opened 4 years ago

TomHarrop commented 4 years ago

Hi,

I'm wondering if there's anything I can do with a run of RepeatModeler that is taking a long time with eleredef. Is there a way to skip this step? Or to tell how long it's going to take, or work out how far through it is?

I'm running RepeatModeler from the Dfam TE Tools Container (dfam/tetools:1.1) on a 1.2 GB genome like this:

RepeatModeler \
    -database flye_assemble \
    -engine ncbi \
    -pa 144 \
    -dir /path/to/output/095_repeatmasker/flye_assemble \
    -recoverDir /path/to/output/095_repeatmasker/flye_assemble

But eleredef has been running on seqnames for about two weeks, using >100 GB of RAM, e.g.

209093 tomharr+  20   0  137.6g 137.6g    340 R  97.6 13.7  17864:40 eleredef     

(That's 17.8 thousand minutes, i.e. 1.8 weeks.)

Here's the output from RepeatModeler:

Comparison Time: 99:20:32 (hh:mm:ss) Elapsed Time, 508548972 HSPs Collected
  - RECON: Running imagespread..
RECON Elapsed: 02:32:08 (hh:mm:ss) Elapsed Time
  - RECON: Running initial definition of elements ( eledef )..
RECON Elapsed: 70:19:52 (hh:mm:ss) Elapsed Time
  - RECON: Running re-definition of elements ( eleredef )..

I don't really know what eleredef is doing, but seqnames is 12,315 lines and there are 6622 batch-*.fa files in the round-5 folder. There are currently 143,068 files in round-5/ele_redef_res.

Thanks!

jebrosen commented 4 years ago

I'm wondering if there's anything I can do with a run of RepeatModeler that is taking a long time with eleredef. Is there a way to skip this step? Or to tell how long it's going to take, or work out how far through it is?

Unfortunately not - RECON is not actively maintained, and it would take some knowledge of the underlying algorithm to know if skipping that step is possible or to add time estimates. But it is unusual for it to take this long.

But eleredef has been running on seqnames for about two weeks, using >100 GB of RAM 508548972 HSPs Collected

500 million HSPs seems very high to me, even in round 5 - for comparison, a run we have done on hg38 found 2 million HSPs in round 5. Is your genome particularly repeat-rich or otherwise interesting?

Did other rounds find similarly high numbers of HSPs, or only this one? It is also possible you were unlucky and got a "bad" (overly complex to analyze) portion of the genome during sampling and that a new run with a different seed would be just fine.

TomHarrop commented 4 years ago

Thanks for the reply!

Yep, it's repeat rich. One run that did complete on another assembly of the same genome found 80% repeats, and it's a 1.2 GB genome. For that run I used RepeatModeler / RepeatMasker that I had manually installed in a Singularity container, rather than using dfam/tetools.

I don't know how many HSPs there were in round 4, because it's a -recoverDir run and I've foolishly overwritten the round 1–4 log file with this log. Is there another way to find that information in the round-4 directory?

jebrosen commented 4 years ago

You can currently count the number of lines in the file round-4/msps.out to get that same number.

TomHarrop commented 4 years ago

Right, it's much less in round 4.

$ wc -l msps.out
85469 msps.out
jebrosen commented 4 years ago

Can you post the rest of the log file (whatever you do have)? And another thought - since you are running the TETools container, can you provide the command line you used to start the container as well?

TomHarrop commented 4 years ago

It's launched via snakemake, which activates the singularity image. Ends up being a pretty complicated command:

singularity exec \
  -B ${PWD} \
  -H $(mktemp -d) \
  --pwd ${PWD} \
  --containall --cleanenv --writable-tmpfs \
  .snakemake/singularity/c473ef0ae582f9008f27da05ffafc1cf.simg
    cd /path/to/output/095_repeatmasker/flye_assemble || exit 1 ; \
    RepeatModeler \
    -database flye_assemble \
    -engine ncbi \
    -pa 144 \
    -dir /path/to/output/095_repeatmasker/flye_assemble \
    -recoverDir /path/to/output/095_repeatmasker/flye_assemble \
&> /path/to/output/logs/rm_model.flye_assemble.log

The singularity container is just a local image pulled from dfam/tetools:1.1, with TRF added like so:

    # RM is configured for trf to be in /opt/trf
    wget \
        -O /opt/trf \
        http://tandem.bu.edu/trf/downloads/trf409.linux64
    chmod +x /opt/trf

    # allow writing to RM Library dir
    chmod -R 777 /opt/RepeatMasker/Libraries

Here's the log (minus a few thousand lines showing the progress)

RepeatModeler Version 2.0.1
===========================
Search Engine = rmblast 2.10.0+
Dependencies: TRF 4.09, RECON 1.08, RepeatScout 1.0.5, RepeatMasker 4.1.0
LTR Structural Analysis: Disabled [use -LTRStruct to enable]
Random Number Seed: 1583637913
Database = flye_assemble .......
  - Sequences = 61446
  - Bases = 1753653027
  - N50 = 87563
  - Contig Histogram:
  Size(bp)                                                        Count
  -----------------------------------------------------------------------
  2008114-2151516 |                                                   [ 1 ]
  1864713-2008114 |                                                   [  ]
  1721312-1864713 |                                                   [  ]
  1577911-1721312 |                                                   [  ]
  1434510-1577911 |                                                   [ 1 ]
  1291109-1434510 |                                                   [ 1 ]
  1147708-1291109 |                                                   [ 1 ]
  1004307-1147708 |                                                   [ 1 ]
  860906-1004307  |                                                   [ 2 ]
  717505-860906   |                                                   [ 10 ]
  574104-717505   |                                                   [ 21 ]
  430703-574104   |                                                   [ 63 ]
  287302-430703   |                                                   [ 280 ]
  143901-287302   |*                                                  [ 1851 ]
  500-143901      |************************************************** [ 59214 ]

** RECOVERING /path/to/output/095_repeatmasker/flye_assemble **

  - Attempting to rerun round 5
  - Backing up previous /path/to/output/095_repeatmasker/flye_assemble/consensi.fa file
  - Backing up previous /path/to/output/095_repeatmasker/flye_assemble/round-5 directory.
  - Recalculating sample size...( please be patient )
Storage Throughput = fair ( 637.82 MB/s )

Ready to start the sampling process.
INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly
      and the repetitive content of the sequences.  It is not imperative
      that RepeatModeler completes all rounds in order to obtain useful
      results.  At the completion of each round, the files ( consensi.fa, and
      families.stk ) found in:
      /path/to/output/095_repeatmasker/flye_assemble/ 
      will contain all results produced thus far. These files may be 
      manually copied and run through RepeatClassifier should the program
      be terminated early.

RepeatModeler Round # 5
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 243000000 bp
   - Sequence extraction : 00:00:14 (hh:mm:ss) Elapsed Time
 -- Running TRFMask on the sequence...
       14369 Tandem Repeats Masked
   - TRFMask time 00:10:15 (hh:mm:ss) Elapsed Time
 -- Sample Stats:
       Sample Size 243005492 bp
       Num Contigs Represented = 11228
       Non ambiguous bp:
             Initial: 243003492 bp
             After Masking: 240310928 bp
             Masked: 1.11 % 
 -- Input Database Coverage: 322096552 bp out of 1753653027 bp ( 18.37 % )
Sampling Time: 00:10:57 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
        0% completed,  5357:18:04 (hh:mm:ss) est. time remaining.

...

       99% completed,  00:0:20 (hh:mm:ss) est. time remaining.
      100% completed,  00:0:00 (hh:mm:ss) est. time remaining.
Comparison Time: 99:20:32 (hh:mm:ss) Elapsed Time, 508548972 HSPs Collected
  - RECON: Running imagespread..
RECON Elapsed: 02:32:08 (hh:mm:ss) Elapsed Time
  - RECON: Running initial definition of elements ( eledef )..
RECON Elapsed: 70:19:52 (hh:mm:ss) Elapsed Time
  - RECON: Running re-definition of elements ( eleredef )..
jebrosen commented 4 years ago

I was worried about TRF, but it looks like you took care of it. @rmhubley might have a better idea, but all I can think of now is to try it again and see if you have better luck with different samples being chosen from the genome.

TomHarrop commented 4 years ago

Hi @jebrosen, just an update - I've tried this 14 times now with variations of this genome and I always get > 500 M HSPs at round 5. In the latest batch of attempts I filtered out contigs < 50 kb, in case the large number of short contigs was causing the issue, but I'm still seeing the same thing.

liufuyan2016 commented 3 years ago

In this step I run five days and have not been completed! This is should be updated for RepeatModeler