Closed boryanakis closed 5 years ago
I am having the same exact issue with Refiner, but with a much more contiguous genome. What is happening here?
I installed RepeatModeler with Bioconda, and I'm having a similar issue. Perhaps it's the install? I tried to use the diatom genome from genbank as a test and no repeats are returned.
This is surprising because it appears to have quite a few family-*.fa files, so I would assume the program is finding something and having trouble building a consensus sequence from the families?
I'm rerunning now to capture the output, but I'm getting the same error as @boryanakis.
In the meantime is there a workaround?
Sorry for the absence folks. We are working like crazy on getting a new Dfam resource ready for the community. Let me try to tackle these. Package tools like Bioconda etc...are both a blessing and a curse. Ideally we should release Bioconda/Docker packages/wrappers ourselves so we can ensure that it will operate correctly. I am not sure how this Bioconda package was put together and I would recommend installing these packages using the instructions on our site ( www.repeatmasker.org ), however I will attempt to make a stab what what might be going wrong.
RepeatModeler Round # 1
... Program duration is 607.0 sec = 10.1 min = 0.2 hr
- Collecting repeat instances... -- Refining Family R=75 / 0 ( RS Elements: 9111, Using 100 ):
There will be single "-- Refining Family" statement at the end of the round for each family discovered by the method ( Round #1 is RepeatScout, Round 2 and above is RECON )
Let me know what you find and I can suggest further things to try.
I've included my version of the run here: diatom.err.txt
The temporary directory seemed to contain everything except for the consensi.fa and the index.html.
I think you are right though, as I've tested an unofficial docker container and the bioconda install and had this issue with both. This led me to believe it was my genome. So today I decided to do diatom and still had this issue. Honestly, I may have had this issue after installing it myself, but with my environment as it is I'm worried there was conflict somewhere.
I'll try installing from instructions again and test diatom to see if it completes tonight/tomorrow.
Thanks!
Ok...this is going to be quite long but I hope it will help you and others. I looked at your log file and the first thing I noticed is the version number at the top of the file is "DEV" ( a huge warning flag ). Whomever built the bioconda package used a non-release version of RepeatModeler! To demonstrate what a correct run should look like , I pulled down to a fresh copy of RepeatModeler and ran Diatom. Since it's such a small genome the run only took a couple of hours. Here are my notes:
# Getting the latest release
% wget https://github.com/rmhubley/RepeatModeler/archive/open-1.0.11.tar.gz
# or you can get it directly from www.repeatmasker.org
# Unpack
% tar zxvf open-1.0.11.tar.gz
% cd RepeatModeler-open-1.0.11/
# Configure
% ./configure
# Check version
% ./RepeatModeler -v
RepeatModeler version open-1.0.11
# Testing on:
# ASM14940v2
# Organism name: Thalassiosira pseudonana CCMP1335 (diatoms)
# Infraspecific name: Strain: CCMP1335
# BioSample: SAMN02744045
# BioProject: PRJNA191
# Submitter: Diatom Consortium
# Date: 2009/01/16
#
% wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/149/405/GCA_000149405.2_ASM14940v2/GCA_000149405.2_ASM14940v2_genomic.fna.gz
% gunzip GCA_000149405.2_ASM14940v2_genomic.fna.gz
% ./BuildDatabase -name diatom -engine ncbi GCA_000149405.2_ASM14940v2_genomic.fna
Building database diatom:
Adding GCA_000149405.2_ASM14940v2_genomic.fna to database
Number of sequences (bp) added to database: 64 ( 32437365 bp )
% ./RepeatModeler -database diatom >& run.log
# run.log:
RepeatModeler Version open-1.0.11
================================
Search Engine = ncbi
Random Number Seed: 1547011952
Database = diatom .
- Sequences = 64
- Bases = 32437365
Using output directory = /home/rhubley/RepeatModeler-open-1.0.11/RM_83332.TueJan82132322019
RepeatModeler Round # 1
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 40000000 bp
- Final Sample Size = 32437365 bp ( 32272623 non ambiguous )
- Num Contigs Represented = 64
-- Running RepeatScout on the sequences...
- RepeatScout: Running build_lmer_table ( l = 14 )..
Program duration is 467.0 sec = 7.8 min = 0.1 hr
- Collecting repeat instances...
-- Refining Family R=9 / 0 ( RS Elements: 786, Using 100 ):
- numRounds = 8
- Consensus Length = 7270 ( orig = 7591 )
- Avg Kimura Divergence = 0.00
- Unaligned sequences = 0 ( orig = 0 )
Build Consensus: 0:3:36 Elapsed Time
Refinement: 00:14:37 (hh:mm:ss) Elapsed Time
-- Refining Family R=0 / 1 ( RS Elements: 700, Using 100 ):
- numRounds = 11
- Consensus Length = 5585 ( orig = 5681 )
- Avg Kimura Divergence = 0.00
- Unaligned sequences = 0 ( orig = 0 )
Build Consensus: 0:13:47 Elapsed Time
...........................
... 176 other families ...
...........................
Family Refinement: 00:29:48 (hh:mm:ss) Elapsed Time
#
# This is a really good RepeatScout run. Over 177 families were found
# in round-1 alone. This is due to two factors. The diatom genome is
# rather small so a larger proportion of the genome is sampled in this
# step. RepeatScout (current version) is really good at finding
# youngish (well conserved) repeats. Evidently Diatom has an abundant
# supply of these.
#
RepeatModeler Round # 2
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 3000000 bp
-- Running TRFMask on the sequence...
63 Tandem Repeats Masked
-- Masking repeats from the previous rounds...
- Masking 1 - 5 of 79
- Masking 16 - 30 of 79
- Masking 41 - 65 of 79
- Masking 76 - 79 of 79
-- Sample Stats:
Sample Size 3055377 bp
Num Contigs Represented = 30
Non ambiguous bp:
Initial: 3035303 bp
After Masking: 2968032 bp
Masked: 2.22 %
-- Input Database Coverage: 3055377 bp out of 32437365 bp ( 9.42 % )
Sampling Time: 00:00:10 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
2% completed, 00:1:17 (hh:mm:ss) est. time remaining.
..........................
100% completed, 00:0:00 (hh:mm:ss) est. time remaining.
...
Number of families returned by RECON: 130
Processing families with greater than 15 elements
Family Refinement: 00:00:00 (hh:mm:ss) Elapsed Time
Round Time: 00:01:32 (hh:mm:ss) Elapsed Time
#
# RECON found an additional 130 new families. Unfortunately
# the sample size in round-2 (3mb) was not sufficient to find
# any family with over 15 copies. So no new families were
# generated in this round.
#
RepeatModeler Round # 3
========================
....
Number of families returned by RECON: 893
Processing families with greater than 15 elements
Family Refinement: 00:00:00 (hh:mm:ss) Elapsed Time
Round Time: 00:11:52 (hh:mm:ss) Elapsed Time
#
# RECON found an additional 893 new families. Again,
# the sample size in round-3 (9mb) was not sufficient to find
# any family with over 15 copies. So no new families were
# generated in this round.
#
RepeatModeler Round # 4
========================
....
Number of families returned by RECON: 3014
Processing families with greater than 15 elements
Processing RECON family: 172
Processing RECON family: 7
Processing RECON family: 161
Processing RECON family: 15
Processing RECON family: 58
Processing RECON family: 507
Processing RECON family: 211
Round Time: 00:57:40 (hh:mm:ss) Elapsed Time
#
# RECON found a whopping 3014 new families relative to
# round-1 (round-2 and round-3 didn't produce any new
# families). In this round the sample was 20mb and out
# of those 3014 families only 7 had >= 15 copies. In
# a small genome like this, with relatively conserved
# families it would make more sense to lower this cutoff
# to around 4. Making this a user-defined parameter is
# something I will add to our TODO list.
#
Discovery complete: 184 families found
Classifying Repeats...
RepeatClassifier Version open-1.0.11
===============================
Search Engine = ncbi
- Looking for Simple and Low Complexity sequences..
- Looking for similarity to known repeat proteins..
- Looking for similarity to known repeat consensi..
Classification Time: 00:07:15 (hh:mm:ss) Elapsed Time
Program Time: 01:59:08 (hh:mm:ss) Elapsed Time
...
# Double check the output files
% ls -al diatom-*
-rw-r--r--. 1 rhubley repeat 168486 Jan 8 23:31 diatom-families.fa
-rw-r--r--. 1 rhubley repeat 9478410 Jan 8 23:31 diatom-families.stk
# How many FASTA sequences and Stockholm multiple alignments do we have
% fgrep -c ">" diatom-families.fa
184
% fgrep -c "# STOCKHOLM" diatom-families.stk
184
# What you should see in the temporary results directory
% ls -al RM_83332.TueJan82132322019
-rw-r--r--. 1 rhubley repeat 164076 Jan 8 23:24 consensi.fa
-rw-r--r--. 1 rhubley repeat 168486 Jan 8 23:31 consensi.fa.classified
-rw-r--r--. 1 rhubley repeat 166636 Jan 8 23:24 consensi.fa.masked
-rw-r--r--. 1 rhubley repeat 9478410 Jan 8 23:31 families-classified.stk
-rw-r--r--. 1 rhubley repeat 9465236 Jan 8 23:24 families.stk
drwxr-xr-x. 2 rhubley repeat 57344 Jan 8 22:13 round-1/
drwxr-xr-x. 7 rhubley repeat 8192 Jan 8 22:14 round-2/
drwxr-xr-x. 7 rhubley repeat 24576 Jan 8 22:26 round-3/
drwxr-xr-x. 7 rhubley repeat 53248 Jan 8 23:24 round-4/
http://www.repeatmasker.org/thalassiosira-pseudonana-RMod-1.0.11.tar.gz
I will followup with the Bioconda folks to get the package pulled or fixed ( preferably the later ).
This definitely makes sense. I looked for my log files but it has been six months since I worked on this so I have deleted them. However, I do remember noticing the "DEV" in the version. I understand why we should install RM the way it is described on the webpage but the reason we are looking for an installation through conda or docker is that software installation is a challenge for many of us. I guess this is my way of saying 'please please please try fixing the conda package instead of pulling it". Thank you for looking into it!
The conda package still installs the DEV version. @rmhubley I think it should be pulled ASAP until it's fixed.
I agree. But we don't manage that distribution. The explosion of package managers hasn't made the world a better place.
After getting same issue as described above I installed and configured repeatmodeler manually and after 20 min of run I still got: "NOTE: RepeatScout did not return any models." Since my sequence is human this is very odd. What could be the problem? Could it possibly be some issue with configuration (I am using engine ncbi) ?
In your log file, what version does it say you're running?
On Mon, Nov 25, 2019, 11:44 PM astulaaa notifications@github.com wrote:
After getting same issue as described above I installed and configured repeatmodeler manually and after 20 min of run I still got: "NOTE: RepeatScout did not return any models." Since my sequence is human this is very odd. What could be the problem? Could it possibly be some issue with configuration (I am using engine ncbi) ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Dfam-consortium/RepeatModeler/issues/15?email_source=notifications&email_token=ABMUDUWUYUREL2UJXQ75M7TQVTHXDA5CNFSM4FLND7IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFFBPAI#issuecomment-558503809, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMUDUVZO65URFJ55CF4EVLQVTHXDANCNFSM4FLND7IA .
@astulaaa, how did you get/install RepeatModeler and which version of RepeatModeler and RepeatScout are you running? Also would you mind running the following two commands to test if your RepeatScout installation is configured correctly? The first command creates a short fasta file with a tandem sequence in it. The second command runs the RepeatScout filter-stage-1.prl script to screen this file for tandem sequences. The output should look like the one I pasted below the command:
% echo ">test\nCGCCACAACGACGCGACACACACACACCACACCACACACCACGACGCATACTACACACACACACACACACACACCACA" > test.fa
% /_the_location_of_RepeatScout_files_/filter-stage-1.prl test.fa
Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.
Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.deleting >test: 0.91025641025641 / 0.807692307692308
1 deleted. 2 saved. 1 skipped for length.
Hello Robert,
I have installed RepeatModeler by downloading open-1.0.11.tar.gz and uncompressing, configured using their configure file RepeatModeler Version open-1.0.11 Repeat scout was installed using conda and linked to the bin directory in RepeatModeler configuration file: /YYY/xxx/anaconda3/envs/RepeatScout/bin Repeatscout v1.0.5 to my schock seems that in Repeatscout installed with conda there is no such file "filter-stage-1.prl" in response to that I installed Repeat scout from https://bix.ucsd.edu/repeatscout/ and running the second command me got an error "sh: 1: trf: not found" The error message linked me to "filter-stage-1.prl line 110" So it seems there is a problem with RepeatScout
On Wed, Nov 27, 2019 at 2:22 AM Robert Hubley notifications@github.com wrote:
@astulaaa https://github.com/astulaaa, how did you get/install RepeatModeler and which version of RepeatModeler and RepeatScout are you running? Also would you mind running the following two commands to test if your RepeatScout installation is configured correctly? The first command creates a short fasta file with a tandem sequence in it. The second command runs the RepeatScout filter-stage-1.prl script to screen this file for tandem sequences. The output should look like the one I pasted below the command:
% echo ">test\nCGCCACAACGACGCGACACACACACACCACACCACACACCACGACGCATACTACACACACACACACACACACACCACA" > test.fa
% /_the_location_of_RepeatScoutfiles/filter-stage-1.prl test.fa
Tandem Repeats Finder, Version 4.09 Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.
Loading sequence... Allocating Memory... Initializing data structures... Computing TR Model Statistics... Scanning... Freeing Memory... Resolving output... Done.deleting >test: 0.91025641025641 / 0.807692307692308 1 deleted. 2 saved. 1 skipped for length.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Dfam-consortium/RepeatModeler/issues/15?email_source=notifications&email_token=AKVQX3P5SV62S7CHAP4DHRLQVVLN5A5CNFSM4FLND7IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFGZRSA#issuecomment-558733512, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKVQX3PFFCRDO4QLZJCIX43QVVLN5ANCNFSM4FLND7IA .
--
Asta Blažytė* Mobile: +821041484165
Dream big, aim high !
Post Nubila Phoebus
Ah...much as I would like to love Bioconda it has been the source of many configuration problems for us. I recommend pulling down RepeatScout from here: http://www.repeatmasker.org/RepeatScout-1.0.6.tar.gz. After you have it compiled and installed, simply rerun the RepeatModeler configure program to point to the newly installed RepeatScout. Also, I should point out that in order to run that test "trf" needed to be in your path. For instance if you the bash shell and you have named the TRF program "trf" then:
% echo ">test\nCGCCACAACGACGCGACACACACACACCACACCACACACCACGACGCATACTACACACACACACACACACACACCACA" > test.fa
% export PATH=${PATH}:_the_location_of_TRF/
% /_the_location_of_RepeatScout_files_/filter-stage-1.prl test.fa
I added trf to the path at .bashrc Got same output like you described I will test RepeatModeler again
On Thu, Nov 28, 2019 at 12:07 PM Robert Hubley notifications@github.com wrote:
Ah...much as I would like to love Bioconda it has been the source of many configuration problems for us. I recommend pulling down RepeatScout from here: http://www.repeatmasker.org/RepeatScout-1.0.6.tar.gz. After you have it compiled and installed, simply rerun the RepeatModeler configure program to point to the newly installed RepeatScout. Also, I should point out that in order to run that test "trf" needed to be in your path. For instance if you the bash shell and you have named the TRF program "trf" then:
% echo ">test\nCGCCACAACGACGCGACACACACACACCACACCACACACCACGACGCATACTACACACACACACACACACACACCACA" > test.fa
% export PATH=${PATH}:_the_location_of_TRF/
% /_the_location_of_RepeatScoutfiles/filter-stage-1.prl test.fa
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Dfam-consortium/RepeatModeler/issues/15?email_source=notifications&email_token=AKVQX3JEAQYSR6BRRIGQ37LQV4YXNA5CNFSM4FLND7IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFLJSAA#issuecomment-559323392, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKVQX3KQMIYDP336PXNNK4TQV4YXNANCNFSM4FLND7IA .
--
Asta Blažytė* Mobile: +821041484165
Dream big, aim high !
Post Nubila Phoebus
update: this time I did not get "NOTE: RepeatScout did not return any models." after the first run however, end result is same. repeatmodeler ran 5 rounds and returned: "Discovery complete: 0 families found Program Time: 00:34:47 (hh:mm:ss) Elapsed Time No families identified. Perhaps the database is too small or contains overly fragmented sequences." I don't know if it is useful but after each round except the first one I've seen " 0 HSPs Collected" at the end of the round
On Thu, Nov 28, 2019 at 9:09 PM Asta Blazyte blazyte.asta@gmail.com wrote:
I added trf to the path at .bashrc Got same output like you described I will test RepeatModeler again
On Thu, Nov 28, 2019 at 12:07 PM Robert Hubley notifications@github.com wrote:
Ah...much as I would like to love Bioconda it has been the source of many configuration problems for us. I recommend pulling down RepeatScout from here: http://www.repeatmasker.org/RepeatScout-1.0.6.tar.gz. After you have it compiled and installed, simply rerun the RepeatModeler configure program to point to the newly installed RepeatScout. Also, I should point out that in order to run that test "trf" needed to be in your path. For instance if you the bash shell and you have named the TRF program "trf" then:
% echo ">test\nCGCCACAACGACGCGACACACACACACCACACCACACACCACGACGCATACTACACACACACACACACACACACCACA" > test.fa
% export PATH=${PATH}:_the_location_of_TRF/
% /_the_location_of_RepeatScoutfiles/filter-stage-1.prl test.fa
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Dfam-consortium/RepeatModeler/issues/15?email_source=notifications&email_token=AKVQX3JEAQYSR6BRRIGQ37LQV4YXNA5CNFSM4FLND7IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFLJSAA#issuecomment-559323392, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKVQX3KQMIYDP336PXNNK4TQV4YXNANCNFSM4FLND7IA .
--
Asta Blažytė* Mobile: +821041484165
Dream big, aim high ! Post Nubila Phoebus
--
Asta Blažytė* Mobile: +821041484165
Dream big, aim high !
Post Nubila Phoebus
It's hard to say without knowing more about the sequences you are feeding to RepeatModeler. Let's just make sure there aren't any other lurking problems with your installation. Why don't you do a test run on this example sequence: http://www.repeatmasker.org/~rhubley/chr7-1mb.fa.gz
Here I run it using the -srand option so that the sequence sampling reproduces exactly the run I did:
% BuildDatabase -name chr7-1mb chr7-1mb.fa
% RepeatModeler -database chr7-1mb -srand 1574962475
RepeatModeler Version open-1.0.11
================================
Search Engine = ncbi
Random Number Seed: 1574962475
Database = chr7-1mb
...
RepeatModeler Round # 1
========================
...
-- Refining Family R=1 / 0 ( RS Elements: 1314, Using 100 ):
...
-- Refining Family R=8 / 1 ( RS Elements: 22, Using 22 ):
...
-- Refining Family R=2 / 2 ( RS Elements: 19, Using 19 ):
...
-- Refining Family R=10 / 3 ( RS Elements: 19, Using 19 ):
...
RepeatModeler Round # 2
========================
...
Comparison Time: 00:00:59 (hh:mm:ss) Elapsed Time, 110110 HSPs Collected
...
Number of families returned by RECON: 297
...
Discovery complete: 11 families found
i user all the latest to run the chr7-1mb chr7-1mb.fa.
Searching for Repeats -- Sampling from the database...
ERROR from search engine (3) : 0 found in 00:00:01 (hh:mm:ss) Elapsed Time
Searching for Repeats -- Sampling from the database...
ERROR from search engine (3) WARNING: Retrying batch ( 1 ) [ 3 ]...
ERROR from search engine (3) WARNING: Retrying batch ( 1 ) [ 3 ]...
ERROR from search engine (3)
FATAL ERROR: RepeatModeler giving up. One or more batches failed! Unfortunately this type of error cannot be recovered from. Please submit the following details to the feedback page at the repeatmasker website:
http://www.repeatmasker.org
RepeatModeler Version: 2.0.4 Search Engine: rmblast [ 2.13.0+ ] Command Line: /home/zhuchenzhao/software/RepeatModeler-2.0.4/RepeatModeler-database chr7-1mb -srand 1574962475 Batch Number: 1 Disk Space: Filesystem 1K-blocks Used Available Use% Mounted on 10.0.0.3:/home 9374631936 1783097344 7591534592 20% /home
System Memory: Further details about this problem may be found in the directory: /home/zhuchenzhao/dyy/hifi/D01/flye.clean.fq/RepeatMasker/RM_2351160.FriMar171006582023
i use the command RepeatModeler-2.0.4/RepeatModeler -database chr7-1mb -srand 1574962475
,and returned less repeat than yours , what is the possible reason
Using output directory = /home/hang/work/data1/fit2/repeat/RM_958894.TueApr182017492023 Search Engine = rmblast 2.13.0+ Dependencies: TRF 4.09, RECON , RepeatScout 1.0.5, RepeatMasker 4.1.4 LTR Structural Analysis: Disabled [use -LTRStruct to enable] Random Number Seed: 1574962475 Database = /home/hang/work/data1/fit2/repeat/chr7-1mb
Ready to start the sampling process. INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly and the repetitive content of the sequences. It is not imperative that RepeatModeler completes all rounds in order to obtain useful results. At the completion of each round, the files ( consensi.fa, and families.stk ) found in: /home/hang/work/data1/fit2/repeat/RM_958894.TueApr182017492023/ will contain all results produced thus far. These files may be manually copied and run through RepeatClassifier should the program be terminated early.
Searching for Repeats -- Sampling from the database...
Searching for Repeats -- Sampling from the database...
RepeatScout/RECON discovery complete: 9 families found
Program Time: 00:20:27 (hh:mm:ss) Elapsed Time Working directory: /home/hang/work/data1/fit2/repeat/RM_958894.TueApr182017492023 may be deleted unless there were problems with the run.
The results have been saved to: /home/hang/work/data1/fit2/repeat/chr7-1mb-families.fa - Consensus sequences for each family identified. /home/hang/work/data1/fit2/repeat/chr7-1mb-families.stk - Seed alignments for each family identified. /home/hang/work/data1/fit2/repeat/chr7-1mb-rmod.log - Execution log. Useful for reproducing results.
The RepeatModeler stockholm file is formatted so that it can easily be submitted to the Dfam database. Please consider contributing curated families to this open database and be a part of this growing community resource. For more information contact help@dfam.org.`
Hi. I have a large reptilian genome that is fairly fragmented (stats below). It has been put through RepeatModeler once before successfully (Dec 2016). Now, I am trying to replicate a lot of the work done on the genome in preparation for annotation, and I am hitting a wall with RM.
It took over a month to run, the output is empty, and the following keeps showing up in the log:
WARNING: Refiner did not return a consensus.
Tail of output:
I am using a conda environment with the following specs:
$ conda list # packages in environment at /MY_PATH/conda_envs/RepeatModeler: #
Here are the summary stats for the fasta file:
I have ran RM many times on many different genome assemblies, and this is the first time I have seen this behavior. Any suggestions or advice?