boryanakis commented 6 years ago

Hi. I have a large reptilian genome that is fairly fragmented (stats below). It has been put through RepeatModeler once before successfully (Dec 2016). Now, I am trying to replicate a lot of the work done on the genome in preparation for annotation, and I am hitting a wall with RM.

It took over a month to run, the output is empty, and the following keeps showing up in the log: WARNING: Refiner did not return a consensus.

Tail of output:

Discovery complete: 0 families found
Program Time: 809:56:23 (hh:mm:ss) Elapsed Time
Working directory:  /MY_PATH/RM_133989.MonJun181600572018
may be deleted unless there were problems with the run.
No families identified.  Perhaps the database is too small
or contains overly fragmented sequences.

I am using a conda environment with the following specs:

$ conda list # packages in environment at /MY_PATH/conda_envs/RepeatModeler: #

blast                     2.2.31                        1    bioconda
boost                     1.57.0                        4
bzip2                     1.0.6                         1    conda-forge
hmmer                     3.2                           0    bioconda
icu                       58.2                          0    conda-forge
libgcc                    5.2.0                         0
perl                      5.22.0.1                      0    conda-forge
perl-text-soundex         3.05               pl5.22.0.1_0    conda-forge
recon                     1.08                          0    bioconda
repeatmasker              4.0.7               pl5.22.0_11    bioconda
repeatmodeler             1.0.11               pl5.22.0_0    bioconda
repeatscout               1.0.5                         0    bioconda
rmblast                   2.2.28                        3    bioconda
trf                       4.09                          0    bioconda
zlib                      1.2.11                        0    conda-forge

Here are the summary stats for the fasta file:

Total scaffs:   878866
Total seq:  3184760166 bp
Avg. seq:   3623.72 bp
Median seq: 238.00 bp
N 50:       1875091 bp
Min seq:    133 bp
Max seq:    18780145 bp

I have ran RM many times on many different genome assemblies, and this is the first time I have seen this behavior. Any suggestions or advice?

ayala-usma commented 6 years ago

I am having the same exact issue with Refiner, but with a much more contiguous genome. What is happening here?

coreywischmeyer commented 5 years ago

I installed RepeatModeler with Bioconda, and I'm having a similar issue. Perhaps it's the install? I tried to use the diatom genome from genbank as a test and no repeats are returned.

This is surprising because it appears to have quite a few family-*.fa files, so I would assume the program is finding something and having trouble building a consensus sequence from the families?

I'm rerunning now to capture the output, but I'm getting the same error as @boryanakis.

In the meantime is there a workaround?

rmhubley commented 5 years ago

Sorry for the absence folks. We are working like crazy on getting a new Dfam resource ready for the community. Let me try to tackle these. Package tools like Bioconda etc...are both a blessing and a curse. Ideally we should release Bioconda/Docker packages/wrappers ourselves so we can ensure that it will operate correctly. I am not sure how this Bioconda package was put together and I would recommend installing these packages using the instructions on our site ( www.repeatmasker.org ), however I will attempt to make a stab what what might be going wrong.

Did the program output ( log messages on the screen ) indicate for each round how may families were found?

RepeatModeler Round # 1

... Program duration is 607.0 sec = 10.1 min = 0.2 hr

Collecting repeat instances... -- Refining Family R=75 / 0 ( RS Elements: 9111, Using 100 ):

There will be single "-- Refining Family" statement at the end of the round for each family discovered by the method ( Round #1 is RepeatScout, Round 2 and above is RECON )

Next, what does the temporary output directory contain? The directory is automatically generated in each run and looks like like : RM_1623.ThuMay311504482018. It should have the files: consensi.fa families.stk and the directories: round-1/ round-2/ round-3/ ... etc Check that the files are not empty.

Let me know what you find and I can suggest further things to try.

coreywischmeyer commented 5 years ago

I've included my version of the run here: diatom.err.txt

The temporary directory seemed to contain everything except for the consensi.fa and the index.html.

I think you are right though, as I've tested an unofficial docker container and the bioconda install and had this issue with both. This led me to believe it was my genome. So today I decided to do diatom and still had this issue. Honestly, I may have had this issue after installing it myself, but with my environment as it is I'm worried there was conflict somewhere.

I'll try installing from instructions again and test diatom to see if it completes tonight/tomorrow.

Thanks!

rmhubley commented 5 years ago

Ok...this is going to be quite long but I hope it will help you and others. I looked at your log file and the first thing I noticed is the version number at the top of the file is "DEV" ( a huge warning flag ). Whomever built the bioconda package used a non-release version of RepeatModeler! To demonstrate what a correct run should look like , I pulled down to a fresh copy of RepeatModeler and ran Diatom. Since it's such a small genome the run only took a couple of hours. Here are my notes:

# Getting the latest release 
% wget https://github.com/rmhubley/RepeatModeler/archive/open-1.0.11.tar.gz
# or you can get it directly from www.repeatmasker.org

# Unpack
% tar zxvf open-1.0.11.tar.gz
% cd RepeatModeler-open-1.0.11/

# Configure
% ./configure

# Check version
% ./RepeatModeler -v
RepeatModeler version open-1.0.11

# Testing on:
#    ASM14940v2
#    Organism name: Thalassiosira pseudonana CCMP1335 (diatoms)
#   Infraspecific name: Strain: CCMP1335
#    BioSample: SAMN02744045
#    BioProject: PRJNA191
#    Submitter: Diatom Consortium
#    Date: 2009/01/16
#
% wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/149/405/GCA_000149405.2_ASM14940v2/GCA_000149405.2_ASM14940v2_genomic.fna.gz
% gunzip GCA_000149405.2_ASM14940v2_genomic.fna.gz

% ./BuildDatabase -name diatom -engine ncbi GCA_000149405.2_ASM14940v2_genomic.fna
Building database diatom:
  Adding GCA_000149405.2_ASM14940v2_genomic.fna to database
Number of sequences (bp) added to database: 64 ( 32437365 bp )

% ./RepeatModeler -database diatom >& run.log

# run.log:
RepeatModeler Version open-1.0.11
================================
Search Engine = ncbi
Random Number Seed: 1547011952
Database = diatom .
  - Sequences = 64
  - Bases = 32437365
Using output directory = /home/rhubley/RepeatModeler-open-1.0.11/RM_83332.TueJan82132322019

RepeatModeler Round # 1
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 40000000 bp
   - Final Sample Size = 32437365 bp ( 32272623 non ambiguous )
   - Num Contigs Represented = 64
 -- Running RepeatScout on the sequences...
   - RepeatScout: Running build_lmer_table ( l = 14 )..
Program duration is 467.0 sec = 7.8 min = 0.1 hr
   - Collecting repeat instances...
 -- Refining Family R=9 / 0 ( RS Elements: 786, Using 100 ):
  - numRounds = 8
  - Consensus Length = 7270 ( orig = 7591 )
  - Avg Kimura Divergence = 0.00
  - Unaligned sequences = 0 ( orig = 0 )
  Build Consensus: 0:3:36 Elapsed Time
Refinement: 00:14:37 (hh:mm:ss) Elapsed Time
 -- Refining Family R=0 / 1 ( RS Elements: 700, Using 100 ):
  - numRounds = 11
  - Consensus Length = 5585 ( orig = 5681 )
  - Avg Kimura Divergence = 0.00
  - Unaligned sequences = 0 ( orig = 0 )
  Build Consensus: 0:13:47 Elapsed Time
 ...........................
 ...  176 other families ...
 ...........................
Family Refinement: 00:29:48 (hh:mm:ss) Elapsed Time

#
# This is a really good RepeatScout run.  Over 177 families were found
# in round-1 alone.  This is due to two factors.  The diatom genome is
# rather small so a larger proportion of the genome is sampled in this
# step.  RepeatScout (current version) is really good at finding 
# youngish (well conserved) repeats.  Evidently Diatom has an abundant
# supply of these.
#

RepeatModeler Round # 2
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 3000000 bp
 -- Running TRFMask on the sequence...
       63 Tandem Repeats Masked
 -- Masking repeats from the previous rounds...
  - Masking 1 - 5 of 79
  - Masking 16 - 30 of 79
  - Masking 41 - 65 of 79
  - Masking 76 - 79 of 79
 -- Sample Stats:
       Sample Size 3055377 bp
       Num Contigs Represented = 30
       Non ambiguous bp:
             Initial: 3035303 bp
             After Masking: 2968032 bp
             Masked: 2.22 % 
 -- Input Database Coverage: 3055377 bp out of 32437365 bp ( 9.42 % )
Sampling Time: 00:00:10 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
        2% completed,  00:1:17 (hh:mm:ss) est. time remaining.
         ..........................
      100% completed,  00:0:00 (hh:mm:ss) est. time remaining.
...
Number of families returned by RECON: 130
Processing families with greater than 15 elements
Family Refinement: 00:00:00 (hh:mm:ss) Elapsed Time
Round Time: 00:01:32 (hh:mm:ss) Elapsed Time

#
# RECON found an additional 130 new families.  Unfortunately
# the sample size in round-2 (3mb) was not sufficient to find
# any family with over 15 copies.  So no new families were 
# generated in this round.
#

RepeatModeler Round # 3
========================
....
Number of families returned by RECON: 893
Processing families with greater than 15 elements
Family Refinement: 00:00:00 (hh:mm:ss) Elapsed Time
Round Time: 00:11:52 (hh:mm:ss) Elapsed Time

#
# RECON found an additional 893 new families.  Again, 
# the sample size in round-3 (9mb) was not sufficient to find
# any family with over 15 copies.  So no new families were 
# generated in this round.
#

RepeatModeler Round # 4
========================
....
Number of families returned by RECON: 3014
Processing families with greater than 15 elements
Processing RECON family: 172
Processing RECON family: 7
Processing RECON family: 161
Processing RECON family: 15
Processing RECON family: 58
Processing RECON family: 507
Processing RECON family: 211
Round Time: 00:57:40 (hh:mm:ss) Elapsed Time
#
# RECON found a whopping 3014 new families relative to
# round-1 (round-2 and round-3 didn't produce any new
# families).  In this round the sample was 20mb and out
# of those 3014 families only 7 had >= 15 copies.  In
# a small genome like this, with relatively conserved 
# families it would make more sense to lower this cutoff
# to around 4.  Making this a user-defined parameter is
# something I will add to our TODO list. 
#

Discovery complete: 184 families found
Classifying Repeats...
RepeatClassifier Version open-1.0.11
===============================
Search Engine = ncbi
  - Looking for Simple and Low Complexity sequences..
  - Looking for similarity to known repeat proteins..
  - Looking for similarity to known repeat consensi..
Classification Time: 00:07:15 (hh:mm:ss) Elapsed Time
Program Time: 01:59:08 (hh:mm:ss) Elapsed Time
...

# Double check the output files
% ls -al diatom-*
-rw-r--r--. 1 rhubley repeat  168486 Jan  8 23:31 diatom-families.fa
-rw-r--r--. 1 rhubley repeat 9478410 Jan  8 23:31 diatom-families.stk

# How many FASTA sequences and Stockholm multiple alignments do we have
% fgrep -c ">" diatom-families.fa
184
% fgrep -c "# STOCKHOLM" diatom-families.stk
184

# What you should see in the temporary results directory
% ls -al RM_83332.TueJan82132322019
-rw-r--r--. 1 rhubley repeat  164076 Jan  8 23:24 consensi.fa
-rw-r--r--. 1 rhubley repeat  168486 Jan  8 23:31 consensi.fa.classified
-rw-r--r--. 1 rhubley repeat  166636 Jan  8 23:24 consensi.fa.masked
-rw-r--r--. 1 rhubley repeat 9478410 Jan  8 23:31 families-classified.stk
-rw-r--r--. 1 rhubley repeat 9465236 Jan  8 23:24 families.stk
drwxr-xr-x. 2 rhubley repeat   57344 Jan  8 22:13 round-1/
drwxr-xr-x. 7 rhubley repeat    8192 Jan  8 22:14 round-2/
drwxr-xr-x. 7 rhubley repeat   24576 Jan  8 22:26 round-3/
drwxr-xr-x. 7 rhubley repeat   53248 Jan  8 23:24 round-4/

The library I produced is available for download here:

http://www.repeatmasker.org/thalassiosira-pseudonana-RMod-1.0.11.tar.gz

rmhubley commented 5 years ago

I will followup with the Bioconda folks to get the package pulled or fixed ( preferably the later ).

boryanakis commented 5 years ago

This definitely makes sense. I looked for my log files but it has been six months since I worked on this so I have deleted them. However, I do remember noticing the "DEV" in the version. I understand why we should install RM the way it is described on the webpage but the reason we are looking for an installation through conda or docker is that software installation is a challenge for many of us. I guess this is my way of saying 'please please please try fixing the conda package instead of pulling it". Thank you for looking into it!

Neato-Nick commented 5 years ago

The conda package still installs the DEV version. @rmhubley I think it should be pulled ASAP until it's fixed.

rmhubley commented 5 years ago

I agree. But we don't manage that distribution. The explosion of package managers hasn't made the world a better place.

astulaaa commented 5 years ago

After getting same issue as described above I installed and configured repeatmodeler manually and after 20 min of run I still got: "NOTE: RepeatScout did not return any models." Since my sequence is human this is very odd. What could be the problem? Could it possibly be some issue with configuration (I am using engine ncbi) ?

Neato-Nick commented 5 years ago

In your log file, what version does it say you're running?

On Mon, Nov 25, 2019, 11:44 PM astulaaa notifications@github.com wrote:

After getting same issue as described above I installed and configured repeatmodeler manually and after 20 min of run I still got: "NOTE: RepeatScout did not return any models." Since my sequence is human this is very odd. What could be the problem? Could it possibly be some issue with configuration (I am using engine ncbi) ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Dfam-consortium/RepeatModeler/issues/15?email_source=notifications&email_token=ABMUDUWUYUREL2UJXQ75M7TQVTHXDA5CNFSM4FLND7IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFFBPAI#issuecomment-558503809, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMUDUVZO65URFJ55CF4EVLQVTHXDANCNFSM4FLND7IA .

rmhubley commented 5 years ago

@astulaaa, how did you get/install RepeatModeler and which version of RepeatModeler and RepeatScout are you running? Also would you mind running the following two commands to test if your RepeatScout installation is configured correctly? The first command creates a short fasta file with a tandem sequence in it. The second command runs the RepeatScout filter-stage-1.prl script to screen this file for tandem sequences. The output should look like the one I pasted below the command:

% echo ">test\nCGCCACAACGACGCGACACACACACACCACACCACACACCACGACGCATACTACACACACACACACACACACACCACA" > test.fa

% /_the_location_of_RepeatScout_files_/filter-stage-1.prl test.fa

Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.

Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.deleting >test: 0.91025641025641 / 0.807692307692308
1 deleted.  2 saved. 1 skipped for length.

astulaaa commented 5 years ago

Hello Robert,

I have installed RepeatModeler by downloading open-1.0.11.tar.gz and uncompressing, configured using their configure file RepeatModeler Version open-1.0.11 Repeat scout was installed using conda and linked to the bin directory in RepeatModeler configuration file: /YYY/xxx/anaconda3/envs/RepeatScout/bin Repeatscout v1.0.5 to my schock seems that in Repeatscout installed with conda there is no such file "filter-stage-1.prl" in response to that I installed Repeat scout from https://bix.ucsd.edu/repeatscout/ and running the second command me got an error "sh: 1: trf: not found" The error message linked me to "filter-stage-1.prl line 110" So it seems there is a problem with RepeatScout

On Wed, Nov 27, 2019 at 2:22 AM Robert Hubley notifications@github.com wrote:

@astulaaa https://github.com/astulaaa, how did you get/install RepeatModeler and which version of RepeatModeler and RepeatScout are you running? Also would you mind running the following two commands to test if your RepeatScout installation is configured correctly? The first command creates a short fasta file with a tandem sequence in it. The second command runs the RepeatScout filter-stage-1.prl script to screen this file for tandem sequences. The output should look like the one I pasted below the command:

% echo ">test\nCGCCACAACGACGCGACACACACACACCACACCACACACCACGACGCATACTACACACACACACACACACACACCACA" > test.fa

% /_the_location_of_RepeatScoutfiles/filter-stage-1.prl test.fa

Tandem Repeats Finder, Version 4.09 Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.

Loading sequence... Allocating Memory... Initializing data structures... Computing TR Model Statistics... Scanning... Freeing Memory... Resolving output... Done.deleting >test: 0.91025641025641 / 0.807692307692308 1 deleted. 2 saved. 1 skipped for length.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Dfam-consortium/RepeatModeler/issues/15?email_source=notifications&email_token=AKVQX3P5SV62S7CHAP4DHRLQVVLN5A5CNFSM4FLND7IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFGZRSA#issuecomment-558733512, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKVQX3PFFCRDO4QLZJCIX43QVVLN5ANCNFSM4FLND7IA .

--

Asta Blažytė* Mobile: +821041484165

         Dream big, aim high !
         Post Nubila Phoebus

rmhubley commented 5 years ago

Ah...much as I would like to love Bioconda it has been the source of many configuration problems for us. I recommend pulling down RepeatScout from here: http://www.repeatmasker.org/RepeatScout-1.0.6.tar.gz. After you have it compiled and installed, simply rerun the RepeatModeler configure program to point to the newly installed RepeatScout. Also, I should point out that in order to run that test "trf" needed to be in your path. For instance if you the bash shell and you have named the TRF program "trf" then:

% echo ">test\nCGCCACAACGACGCGACACACACACACCACACCACACACCACGACGCATACTACACACACACACACACACACACCACA" > test.fa

% export PATH=${PATH}:_the_location_of_TRF/

% /_the_location_of_RepeatScout_files_/filter-stage-1.prl test.fa

astulaaa commented 5 years ago

I added trf to the path at .bashrc Got same output like you described I will test RepeatModeler again

On Thu, Nov 28, 2019 at 12:07 PM Robert Hubley notifications@github.com wrote:

Ah...much as I would like to love Bioconda it has been the source of many configuration problems for us. I recommend pulling down RepeatScout from here: http://www.repeatmasker.org/RepeatScout-1.0.6.tar.gz. After you have it compiled and installed, simply rerun the RepeatModeler configure program to point to the newly installed RepeatScout. Also, I should point out that in order to run that test "trf" needed to be in your path. For instance if you the bash shell and you have named the TRF program "trf" then:

% echo ">test\nCGCCACAACGACGCGACACACACACACCACACCACACACCACGACGCATACTACACACACACACACACACACACCACA" > test.fa

% export PATH=${PATH}:_the_location_of_TRF/

% /_the_location_of_RepeatScoutfiles/filter-stage-1.prl test.fa

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Dfam-consortium/RepeatModeler/issues/15?email_source=notifications&email_token=AKVQX3JEAQYSR6BRRIGQ37LQV4YXNA5CNFSM4FLND7IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFLJSAA#issuecomment-559323392, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKVQX3KQMIYDP336PXNNK4TQV4YXNANCNFSM4FLND7IA .

--

Asta Blažytė* Mobile: +821041484165

         Dream big, aim high !
         Post Nubila Phoebus

astulaaa commented 5 years ago

update: this time I did not get "NOTE: RepeatScout did not return any models." after the first run however, end result is same. repeatmodeler ran 5 rounds and returned: "Discovery complete: 0 families found Program Time: 00:34:47 (hh:mm:ss) Elapsed Time No families identified. Perhaps the database is too small or contains overly fragmented sequences." I don't know if it is useful but after each round except the first one I've seen " 0 HSPs Collected" at the end of the round

On Thu, Nov 28, 2019 at 9:09 PM Asta Blazyte blazyte.asta@gmail.com wrote:

I added trf to the path at .bashrc Got same output like you described I will test RepeatModeler again

On Thu, Nov 28, 2019 at 12:07 PM Robert Hubley notifications@github.com wrote:

Ah...much as I would like to love Bioconda it has been the source of many configuration problems for us. I recommend pulling down RepeatScout from here: http://www.repeatmasker.org/RepeatScout-1.0.6.tar.gz. After you have it compiled and installed, simply rerun the RepeatModeler configure program to point to the newly installed RepeatScout. Also, I should point out that in order to run that test "trf" needed to be in your path. For instance if you the bash shell and you have named the TRF program "trf" then:

% echo ">test\nCGCCACAACGACGCGACACACACACACCACACCACACACCACGACGCATACTACACACACACACACACACACACCACA" > test.fa

% export PATH=${PATH}:_the_location_of_TRF/

% /_the_location_of_RepeatScoutfiles/filter-stage-1.prl test.fa

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Dfam-consortium/RepeatModeler/issues/15?email_source=notifications&email_token=AKVQX3JEAQYSR6BRRIGQ37LQV4YXNA5CNFSM4FLND7IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFLJSAA#issuecomment-559323392, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKVQX3KQMIYDP336PXNNK4TQV4YXNANCNFSM4FLND7IA .

--
Asta Blažytė* Mobile: +821041484165
         Dream big, aim high !
         Post Nubila Phoebus

--

Asta Blažytė* Mobile: +821041484165

         Dream big, aim high !
         Post Nubila Phoebus

rmhubley commented 5 years ago

It's hard to say without knowing more about the sequences you are feeding to RepeatModeler. Let's just make sure there aren't any other lurking problems with your installation. Why don't you do a test run on this example sequence: http://www.repeatmasker.org/~rhubley/chr7-1mb.fa.gz

Here I run it using the -srand option so that the sequence sampling reproduces exactly the run I did:

% BuildDatabase -name chr7-1mb chr7-1mb.fa

% RepeatModeler -database chr7-1mb -srand 1574962475

RepeatModeler Version open-1.0.11
================================
Search Engine = ncbi
Random Number Seed: 1574962475
Database = chr7-1mb
...
RepeatModeler Round # 1
========================
...
 -- Refining Family R=1 / 0 ( RS Elements: 1314, Using 100 ):
...
 -- Refining Family R=8 / 1 ( RS Elements: 22, Using 22 ):
...
 -- Refining Family R=2 / 2 ( RS Elements: 19, Using 19 ):
...
 -- Refining Family R=10 / 3 ( RS Elements: 19, Using 19 ):
...
RepeatModeler Round # 2
========================
...
Comparison Time: 00:00:59 (hh:mm:ss) Elapsed Time, 110110 HSPs Collected
...
Number of families returned by RECON: 297
...
Discovery complete: 11 families found

chenzhaozhu commented 1 year ago

i user all the latest to run the chr7-1mb chr7-1mb.fa.

RepeatModeler Round # 1

Searching for Repeats -- Sampling from the database...

Gathering up to 40000000 bp
Final Sample Size = 1226160 bp ( 1226160 non ambiguous )
Num Contigs Represented = 1
Sequence extraction : 00:00:01 (hh:mm:ss) Elapsed Time -- Running RepeatScout on the sequences...
RepeatScout: Running build_lmer_table ( l = 12 )..
RepeatScout: Running RepeatScout.. : 13 raw families identified
RepeatScout: Running filtering stage.. 12 families remaining
RepeatScout: 00:00:17 (hh:mm:ss) Elapsed Time

ERROR from search engine (3) : 0 found in 00:00:01 (hh:mm:ss) Elapsed Time

Collecting repeat instances... ERROR from search engine (3) Round Time: 00:00:20 (hh:mm:ss) Elapsed Time : 0 families discovered.

RepeatModeler Round # 2

Searching for Repeats -- Sampling from the database...

Gathering up to 10000000 bp
Sequence extraction : 00:00:01 (hh:mm:ss) Elapsed Time -- Running TRFMask on the sequence... 141 Tandem Repeats Masked
TRFMask time 00:00:02 (hh:mm:ss) Elapsed Time -- Sample Stats: Sample Size 1226160 bp Num Contigs Represented = 1 Non ambiguous bp: Initial: 1226160 bp After Masking: 1216538 bp Masked: 0.78 % -- Input Database Coverage: 1226160 bp out of 1226250 bp ( 99.99 % ) Sampling Time: 00:00:03 (hh:mm:ss) Elapsed Time Running all-by-other comparisons...
- Total Comparisons = 465

ERROR from search engine (3) WARNING: Retrying batch ( 1 ) [ 3 ]...

ERROR from search engine (3)

FATAL ERROR: RepeatModeler giving up. One or more batches failed! Unfortunately this type of error cannot be recovered from. Please submit the following details to the feedback page at the repeatmasker website:

   http://www.repeatmasker.org

RepeatModeler Version: 2.0.4 Search Engine: rmblast [ 2.13.0+ ] Command Line: /home/zhuchenzhao/software/RepeatModeler-2.0.4/RepeatModeler-database chr7-1mb -srand 1574962475 Batch Number: 1 Disk Space: Filesystem 1K-blocks Used Available Use% Mounted on 10.0.0.3:/home 9374631936 1783097344 7591534592 20% /home

System Memory: Further details about this problem may be found in the directory: /home/zhuchenzhao/dyy/hifi/D01/flye.clean.fq/RepeatMasker/RM_2351160.FriMar171006582023

bitahu commented 1 year ago

i use the command RepeatModeler-2.0.4/RepeatModeler -database chr7-1mb -srand 1574962475 ,and returned less repeat than yours , what is the possible reason

`RepeatModeler Version 2.0.4

Using output directory = /home/hang/work/data1/fit2/repeat/RM_958894.TueApr182017492023 Search Engine = rmblast 2.13.0+ Dependencies: TRF 4.09, RECON , RepeatScout 1.0.5, RepeatMasker 4.1.4 LTR Structural Analysis: Disabled [use -LTRStruct to enable] Random Number Seed: 1574962475 Database = /home/hang/work/data1/fit2/repeat/chr7-1mb

Sequences = 1
Bases = 1226250 Storage Throughput = poor ( 115.29 MB/s )
NOTE: Poor storage througput will have a large impact on RepeatModeler performance. The low throughput observed above may be due to transient usage patterns on the system and may not reflect the actual system performance. Whenever possible run RepeatModeler in a directory stored on a fast local disk and not over a network filesytem.

Ready to start the sampling process. INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly and the repetitive content of the sequences. It is not imperative that RepeatModeler completes all rounds in order to obtain useful results. At the completion of each round, the files ( consensi.fa, and families.stk ) found in: /home/hang/work/data1/fit2/repeat/RM_958894.TueApr182017492023/ will contain all results produced thus far. These files may be manually copied and run through RepeatClassifier should the program be terminated early.

RepeatModeler Round # 1

Searching for Repeats -- Sampling from the database...

Gathering up to 40000000 bp
Final Sample Size = 1226160 bp ( 1226160 non ambiguous )
Num Contigs Represented = 1
Sequence extraction : 00:00:02 (hh:mm:ss) Elapsed Time -- Running RepeatScout on the sequences...
RepeatScout: Running build_lmer_table ( l = 12 )..
RepeatScout: Running RepeatScout.. : 13 raw families identified
RepeatScout: Running filtering stage.. 12 families remaining
RepeatScout: 00:02:10 (hh:mm:ss) Elapsed Time
Large Satellite Filtering.. : 0 found in 00:00:04 (hh:mm:ss) Elapsed Time
Collecting repeat instances...: 00:00:14 (hh:mm:ss) Elapsed Time Refinement: 00:03:03 (hh:mm:ss) Elapsed Time Family Refinement: 00:03:03 (hh:mm:ss) Elapsed Time Round Time: 00:05:37 (hh:mm:ss) Elapsed Time : 4 families discovered.

RepeatModeler Round # 2

Searching for Repeats -- Sampling from the database...

Gathering up to 10000000 bp
Sequence extraction : 00:00:03 (hh:mm:ss) Elapsed Time -- Running TRFMask on the sequence... 141 Tandem Repeats Masked
TRFMask time 00:00:18 (hh:mm:ss) Elapsed Time -- Masking repeats from the previous rounds... -- Collecting 1261 ranges... 1239 repeats masked totaling 213962 bp(s).
TE Masking time 00:00:16 (hh:mm:ss) Elapsed Time -- Sample Stats: Sample Size 1226160 bp Num Contigs Represented = 1 Non ambiguous bp: Initial: 1226160 bp After Masking: 1002881 bp Masked: 18.21 % -- Input Database Coverage: 1226160 bp out of 1226250 bp ( 99.99 % ) Sampling Time: 00:00:41 (hh:mm:ss) Elapsed Time Running all-by-other comparisons...
- Total Comparisons = 465 6% completed, 00:6:02 (hh:mm:ss) est. time remaining. 12% completed, 00:5:57 (hh:mm:ss) est. time remaining. 18% completed, 00:5:43 (hh:mm:ss) est. time remaining. 24% completed, 00:4:58 (hh:mm:ss) est. time remaining. 30% completed, 00:4:40 (hh:mm:ss) est. time remaining. 35% completed, 00:4:21 (hh:mm:ss) est. time remaining. 40% completed, 00:4:06 (hh:mm:ss) est. time remaining. 45% completed, 00:3:30 (hh:mm:ss) est. time remaining. 50% completed, 00:3:13 (hh:mm:ss) est. time remaining. 54% completed, 00:2:51 (hh:mm:ss) est. time remaining. 59% completed, 00:2:34 (hh:mm:ss) est. time remaining. 63% completed, 00:2:21 (hh:mm:ss) est. time remaining. 67% completed, 00:2:07 (hh:mm:ss) est. time remaining. 70% completed, 00:1:56 (hh:mm:ss) est. time remaining. 74% completed, 00:1:39 (hh:mm:ss) est. time remaining. 77% completed, 00:1:25 (hh:mm:ss) est. time remaining. 80% completed, 00:1:14 (hh:mm:ss) est. time remaining. 83% completed, 00:1:02 (hh:mm:ss) est. time remaining. 85% completed, 00:0:52 (hh:mm:ss) est. time remaining. 88% completed, 00:0:44 (hh:mm:ss) est. time remaining. 90% completed, 00:0:36 (hh:mm:ss) est. time remaining. 92% completed, 00:0:29 (hh:mm:ss) est. time remaining. 93% completed, 00:0:22 (hh:mm:ss) est. time remaining. 95% completed, 00:0:17 (hh:mm:ss) est. time remaining. 96% completed, 00:0:12 (hh:mm:ss) est. time remaining. 97% completed, 00:0:08 (hh:mm:ss) est. time remaining. 98% completed, 00:0:05 (hh:mm:ss) est. time remaining. 99% completed, 00:0:02 (hh:mm:ss) est. time remaining. 99% completed, 00:0:00 (hh:mm:ss) est. time remaining. 100% completed, 00:0:00 (hh:mm:ss) est. time remaining. Comparison Time: 00:06:44 (hh:mm:ss) Elapsed Time, 1102 HSPs Collected
- RECON: Running imagespread.. RECON Elapsed: 00:00:00 (hh:mm:ss) Elapsed Time
- RECON: Running initial definition of elements ( eledef ).. RECON Elapsed: 00:00:01 (hh:mm:ss) Elapsed Time
- RECON: Running re-definition of elements ( eleredef ).. RECON Elapsed: 00:00:01 (hh:mm:ss) Elapsed Time
- RECON: Running re-definition of edges ( edgeredef ).. RECON Elapsed: 00:00:00 (hh:mm:ss) Elapsed Time
- RECON: Running family definition ( famdef ).. RECON Elapsed: 00:00:00 (hh:mm:ss) Elapsed Time
- Obtaining element sequences Number of families returned by RECON: 145 Processing families with greater than 15 elements Instance Gathering: 00:00:00 (hh:mm:ss) Elapsed Time About to run 6 refinement jobs Refinement: 00:03:33 (hh:mm:ss) Elapsed Time Family Refinement: 00:03:33 (hh:mm:ss) Elapsed Time Round Time: 00:11:01 (hh:mm:ss) Elapsed Time : 5 families discovered.

RepeatScout/RECON discovery complete: 9 families found

RepeatClassifier Version 2.0.4

Looking for Simple and Low Complexity sequences..
Looking for similarity to known repeat proteins..
Looking for similarity to known repeat consensi.. Classification Time: 00:03:49 (hh:mm:ss) Elapsed Time

Program Time: 00:20:27 (hh:mm:ss) Elapsed Time Working directory: /home/hang/work/data1/fit2/repeat/RM_958894.TueApr182017492023 may be deleted unless there were problems with the run.

The results have been saved to: /home/hang/work/data1/fit2/repeat/chr7-1mb-families.fa - Consensus sequences for each family identified. /home/hang/work/data1/fit2/repeat/chr7-1mb-families.stk - Seed alignments for each family identified. /home/hang/work/data1/fit2/repeat/chr7-1mb-rmod.log - Execution log. Useful for reproducing results.

The RepeatModeler stockholm file is formatted so that it can easily be submitted to the Dfam database. Please consider contributing curated families to this open database and be a part of this growing community resource. For more information contact help@dfam.org.`

Dfam-consortium / RepeatModeler

Zero families found #15

RepeatModeler Round # 1

The library I produced is available for download here:

RepeatModeler Round # 1

RepeatModeler Round # 2

`RepeatModeler Version 2.0.4

RepeatModeler Round # 1

RepeatModeler Round # 2

RepeatClassifier Version 2.0.4