Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
182 stars 23 forks source link

Restarting LTRHarvest #170

Open CIWa opened 2 years ago

CIWa commented 2 years ago

Hi,

I have reported this bug at www.repeatmasker.org, but was hoping to get some input from here as well. Sorry if this is therefore a duplicate.

My issue: LTRHarvest was running on a single core for me, using the binaries downloaded from the software's homepage. I therefore stopped my RepeatModeler run after 7 rounds and while running LTRHarvest, reinstalled LTRHarvest from scratch with make threads=yes, and restarted my original RepeatModeler run. RepeatModeler then finished within a few seconds, skipping LTRHarvest and any steps following, and reported:

This directory ( RM_601117.SatMay211407242022 )
appears to contain a successful run of RepeatModeler.  If this
is not the case, please report this as a bug at the RepeatMasker
website ( www.repeatmasker.org )

My steps for reproducing the issue:

  1. My original call: ~/programs/RepeatModeler-2.0.3/RepeatModeler -database genome -pa 40 -LTRStruct -genomeSampleSizeMax 729000000 &> run2.out
  2. Stop after 7 rounds and while running LTRHarvest
  3. Restart RepeatModeler via ~/programs/RepeatModeler-2.0.3/RepeatModeler -database genome -recoverDir RM_601117.SatMay211407242022 -pa 40 -LTRStruct -genomeSampleSizeMax 729000000 -genometools_dir ~/programs/genometools-1.6.2/bin/ &> restartrun2.out

Full log of the re-start:

RepeatModeler Version 2.0.3
===========================
Using output directory = RM_601117.SatMay211407242022
Search Engine = rmblast 2.11.0+
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.2
LTR Structural Analysis: Enabled ( GenomeTools 1.6.2, LTR_Retriever v2.9.0,
                                   Ninja 0.95-cluster_only, MAFFT 7.490,
                                   CD-HIT 4.8.1 )
Random Number Seed: 1653420641
Database = genome .
  - Sequences = 5672
  - Bases = 3713099737
  - N50 = 10757908
  - Contig Histogram:
  Size(bp)                                                        Count
  -----------------------------------------------------------------------
  84576323-90617416 |                                                   [ 2 ]
  78535231-84576323 |                                                   [ 1 ]
  72494139-78535231 |                                                   [  ]
  66453046-72494138 |                                                   [ 2 ]
  60411954-66453046 |                                                   [  ]
  54370862-60411954 |                                                   [ 2 ]
  48329769-54370861 |                                                   [ 1 ]
  42288677-48329769 |                                                   [ 2 ]
  36247585-42288677 |                                                   [ 4 ]
  30206492-36247584 |                                                   [ 4 ]
  24165400-30206492 |                                                   [ 9 ]
  18124308-24165400 |                                                   [ 17 ]
  12083215-18124307 |                                                   [ 13 ]
  6042123-12083215  |                                                   [ 51 ]
  1031-6042123      |************************************************** [ 5564 ]

This directory ( RM_601117.SatMay211407242022 )
appears to contain a successful run of RepeatModeler.  If this
is not the case, please report this as a bug at the RepeatMasker
website ( www.repeatmasker.org )

The content of RM_601117.SatMay211407242022 (dates are not original, as this is a back-up): 4.0K May 24 20:23 LTR_697138.TueMay240229512022 300K May 24 20:23 round-7 136K May 24 20:23 round-6 64K May 24 20:23 round-5 20K May 24 20:23 round-4 4.0K May 24 20:23 round-3 4.0K May 24 20:23 round-2 248K May 24 20:23 round-1 2.5M May 24 20:23 tmpConsensi.fa.masked 2.5M May 24 20:23 tmpConsensi.fa 149M May 24 20:23 families.stk 2.5M May 24 20:23 consensi.fa 7.8K May 24 20:23 rmod.log

Content of subdirectory of LTRHarvest: 503 May 24 20:24 esa_index.prj 0 May 24 20:24 ltrharvest.log 0 May 24 20:24 ltrharvest.out 28G May 24 20:24 esa_index.suf 3.1G May 24 20:24 esa_index.llv 3.5G May 24 20:24 esa_index.lcp 23K May 24 20:23 esa_index.ssp 886M May 24 20:23 esa_index.esq 49K May 24 20:23 esa_index.des 183K May 24 20:23 esa_index.md5 45K May 24 20:23 esa_index.sds 0 May 24 20:23 suffixerator.log

My environment:

Some additional context:

fanhuan commented 5 months ago

I have the same issue. I was running RepetModeler with -LTRStruct. and it failed at the LTR Structural Analysis part. When I tried to resume the program by using -recoverDir, it also says that my RepeatModeler run is successful. But I do not have database-family.fa. I was able to locate the log for LTR_retriever (for me it was: RM_14.FriJan50358422024/LTR346857.TueJan90142312024/LRET***/LTR_retriever.log) and in thelog it has the command that was run.

Parameters: -repeatmasker /opt/RepeatMasker -blastplus /opt/rmblast/bin -cdhit_path /opt/cd-hit -trf_path /opt/trf -genome seq.fa -inharvest /opt /RM_14.FriJan50358422024/LTR_346857.TueJan90142312024/raw-struct-results.txt -noanno -threads 20

You just need to locate your LTR_retriever in order to restart it. For me it was:

/opt/LTR_retriever/LTR_retriever -repeatmasker /opt/RepeatMasker -blastplus /opt/rmblast/bin -cdhit_path /opt/cd-hit -trf_path /opt/trf -genome seq.fa -inharvest /opt/RM_14.FriJan50358422024/LTR_346857.TueJan90142312024/raw-struct-results.txt -noanno -threads 20

I think it was able to recognize what was done. The previous run stopped at Module 1 (after running for 12h), but this time it was able to move on the modules 2-5 in less than an hour. This finished and I got results like:

  ##############################
  ####### Result files #########
  ##############################

  Table output for intact LTR-RTs (detailed info)
          seq.fa.pass.list (All LTR-RTs)
          seq.fa.nmtf.pass.list (Non-TGCA LTR-RTs)
          seq.fa.pass.list.gff3 (GFF3 format for intact LTR-RTs)

  LTR-RT library
          seq.fa.LTRlib.redundant.fa (All LTR-RTs with redundancy)
          seq.fa.LTRlib.fa (All non-redundant LTR-RTs)
          seq.fa.nmtf.LTRlib.fa (Non-TGCA LTR-RTs)

According to the RepetModeler website, I believe the correct result for a successful RepeatModeler run with the -LTRStruct option should result in things like this instead:

At the succesful completion of a run, three files are generated:

  <database_name>-families.fa  : Consensus sequences
  <database_name>-families.stk : Seed alignments
  <database_name>-rmod.log     : A summarized log of the run

However, after finishing the LTR_retriever, I still don't have -families.fa. I do have families.stk but not -families.stk. Same with rmod.log.

Please kindly let me know how I can obtain those files.

My environment:

How did you install RepeatModeler? docker from TE-tools (https://github.com/Dfam-consortium/TETools)

Which version of RepeatModeler do you have? RepeatModeler-2.0.5

Which version of RepeatMasker is this RepeatModeler installation using? 4.1.6

Operating system and version: Ubuntu 22.04