Open Hai1983 opened 2 years ago
Hi @Hai1983,
I'm sorry to hear you had this problem. eleredef
is a part of the RECON program, and we have occasionally seen it fail in similar ways before. In this case, it looks like round 6 started but did not finish.
RepeatModeler uses methods that do not analyze an entire genome at once; instead, it analyzes random samples of sequence from the genome. This could have been an unlucky run; another run of RepeatModeler may choose different samples and succeed. RepeatModeler reports a "random seed number" near the beginning of the output, which anyone can use to reproduce the same samples in a new RepeatModeler. So, if the genome is publicly available and you can share the random seed number, we may be able to use that to find or fix this particular error in RECON.
I have a few other suggestions which could be more immediately helpful to you.
First, you could continue without round 6. The files families.stk
and consensi.fa
in your list already include the results through round 5. You can run the last step of RepeatModeler, RepeatClassifier, manually: RepeatClassifier -consensi consensi.fa -stockholm families.stk
.
You could also try the -recoverDir
option for RepeatModeler to reuse the previous results and resume the program at a new round 6, hopefully avoiding the problem with your first run.
I hope this information helps you resolve your issue! Please let us know if you have further questions or problems.
Best, Jeb
Dear Jeb.
Thanks so much for your kind help.
I just made as your suggestions.
(singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif RepeatClassifier -consensi consensi.fa -stockholm families.stk)
The results: -rw-r--r-- 1 daomhai daomhai 47189194 Jan 5 12:05 families-classified.stk -rw-r--r-- 1 daomhai daomhai 816195 Jan 5 12:05 consensi.fa.classified -rw-r--r-- 1 daomhai daomhai 790697 Jan 5 11:42 tmpConsensi.fa
+Now I run ReapMasker with these data.
(singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif RepeatMasker –lib consensi.fa.classified Catfish22a.scaffolds.fa -pa 8)
+I also try to run -recoverDir option for RepeatModeler:
(singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif RepeatModeler -database Catfish22a -engine ncbi -LTRStruct -pa 16 -recoverDir RM_3803372.TueJan40119342022)
I will let you know when these results coming +Beside that, I also made the homology approach by RepeatMasker. Firstly, I run it on default, because I think the information for searching repetitive DNA elements of my genome is available in default of image tetools_latest.sif (FamDB: CONS-Dfam_3.3). But I failed (results below):
(singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif RepeatMasker Catfish22a.scaffolds.fa -pa 16
file name: Catfish22a.scaffolds.fa sequences: 471 total length: 785467626 bp (784572949 bp excl N/X-runs) GC level: 38.96 % bases masked: 39065869 bp ( 4.97 %)
number of length percentage
elements* occupied of sequence
SINEs: 0 0 bp 0.00 % ALUs 0 0 bp 0.00 % MIRs 0 0 bp 0.00 %
LINEs: 0 0 bp 0.00 % LINE1 0 0 bp 0.00 % LINE2 0 0 bp 0.00 % L3/CR1 0 0 bp 0.00 %
LTR elements: 0 0 bp 0.00 % ERVL 0 0 bp 0.00 % ERVL-MaLRs 0 0 bp 0.00 % ERV_classI 0 0 bp 0.00 % ERV_classII 0 0 bp 0.00 %
DNA elements: 0 0 bp 0.00 % hAT-Charlie 0 0 bp 0.00 % TcMar-Tigger 0 0 bp 0.00 %
Unclassified: 0 0 bp 0.00 % Unclassified: 0 0 bp 0.00 %
Total interspersed repeats: 0 bp 0.00 %
Small RNA: 2393 180033 bp 0.02 %
Satellites: 2 610 bp 0.00 % Simple repeats: 703481 34328501 bp 4.37 % Low complexity: 60307 4556725 bp 0.58 %
most repeats fragmented by insertions or deletions have been counted as one element
The query species was assumed to be homo sapiens RepeatMasker version 4.1.2-p1 , default mode
run with rmblastn version 2.11.0+ FamDB: CONS-Dfam_3.3)
FATAL ERROR: RepeatMasker giving up. One or more batches failed! Unfortunately this type of error cannot be recovered from. ……..
Could you help me to figure out what I am doing wrong??
Sincerely, Hai
It looks like you had several different problems now! Hopefully this can help:
First, RepeatMasker uses the library of human repeats by default. To search other species, you will need the -species
option e.g. -species siluroidei
.
Dfam does not currently include many curated families for fish, except in the model organism Danio rerio, but it does include uncurated models from some fish species. With the full Dfam download which includes uncurated families, you could perform a search using an ancestral clade such as Siluroidei as the 'species'. The Siluroidei suborder includes ~7000 putative repeat families from this list, from various catfish species: https://www.dfam.org/browse?clade=1489793&clade_descendants=true&include_raw=true. You could also use a more or less specific clade as appropriate.
Finally, the -lib
option is used for libraries in the FASTA and HMM formats, but not Dfam.h5
. Normally to install Dfam.h5
you would install it to the Libraries/
directory and re-run RepeatMasker's configure
script; however, with a container this is usually more difficult.
For one possible way to use the complete Dfam.h5
and the -species
option with a container, I have adapted these steps. This is similar to setting up a container installation to use RepBase RepeatMasker edition (https://github.com/Dfam-consortium/TETools#using-repbase-repeatmasker-edition).
# Run the container interactively for the following commands
# Navigate to a directory that is mapped in the container to the host system, so that these files can be reused later
$ cd /work
# Make a copy of RepeatMasker's original Libraries directory here
$ cp -r /opt/RepeatMasker/Libraries/ ./
# Replace the default Dfam.h5 and RepeatMaskerLib.h5 with the downloaded files
$ cp Dfam.h5 Libraries/
$ ln -sf Dfam.h5 Libraries/RepeatMaskerLib.h5
# Set the LIBDIR environment variable before running RepeatMasker
$ export LIBDIR=/path/to/Libraries
$ RepeatMasker -species ... genome.fa
After completing these steps the custom Libraries/
directory can be reused for future searches, by setting LIBDIR
again before running RepeatMasker
.
Since these repeat families are uncurated and might not be from your particular species, the output could be lower quality and/or very different from the results of running RepeatModeler on the same species or from a hand-curated library. This would depend on how closely related your species is to the ones in Dfam, and what kinds of repeats are present.
Regards, -Jeb
Dear Jeb.
Thanks for your kindly help.
daomhai@nic5-login1 ~/.RepeatMaskerCache $ ls -lt total 0 drwxr-xr-x 2 daomhai daomhai 207 Jan 4 23:30 general drwxr-xr-x 4 daomhai daomhai 53 Dec 31 13:17 CONS-Dfam_3.3
daomhai@nic5-login1 ~/.RepeatMaskerCache/CONS-Dfam_3.3/general $ ls -lt total 80 -rw-r--r-- 1 daomhai daomhai 369 Dec 31 13:17 rmblastdb.log -rw-r--r-- 1 daomhai daomhai 16384 Dec 31 13:17 is.lib.ntf -rw-r--r-- 1 daomhai daomhai 40 Dec 31 13:17 is.lib.nto -rw-r--r-- 1 daomhai daomhai 116 Dec 31 13:17 is.lib.not -rw-r--r-- 1 daomhai daomhai 20480 Dec 31 13:17 is.lib.ndb -rw-r--r-- 1 daomhai daomhai 801 Dec 31 13:17 is.lib.nhr -rw-r--r-- 1 daomhai daomhai 3972 Dec 31 13:17 is.lib.nsq -rw-r--r-- 1 daomhai daomhai 264 Dec 31 13:17 is.lib.nin -rw-r--r-- 1 daomhai daomhai 16385 Dec 31 13:17 is.lib
The image for tetools_latest.sif was located in : "/scratch/users/d/a/daomhai/Docker/tetools_latest.sif"
+I also try with RepBaseRepeatMaskerEdition-20181026.tar.gz.
tar -x -f RepBaseRepeatMaskerEdition-20181026.tar.gz
And, in the Libraries, I got two file:
/scratch/ulg/bbasv/daomhai/Docker/RepeatMakerHomology/Libraries $ ls README.RMRBSeqs RMRBSeqs.embl singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif addRepBase.pl -libdir Libraries/ Rebuilding RepeatMaskerLib.h5 master library
Thanks you in advance for your help Sincerely, Hai
The original libraries directory is located inside the container, at /opt/RepeatMasker/Libraries
.
It seems like you are using singularity run ...
for each individual command instead of an interactive session. This should still be doable! But, you will also need to adjust the paths in most commands to use absolute paths. For example, singularity run -B ... cp -r /opt/RepeatMasker/Libraries/ /scratch/ulg/bbasv/daomhai/Docker/
, singularity run -B ... addRepBase.pl -libdir /scratch/ulg/bbasv/daomhai/Docker/Libraries/
, and so on.
Dear Jeb.
Thanks for your kindly help. I done following your advice and it work now.
export LIBDIR=/scratch/ulg/bbasv/daomhai/Docker/RepeatMakerHomology/Libraries (**for using custom Libraries created from previous step) singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif RepeatProteinMask Catfish22a.scaffolds.fa
Masking Simple Repeats... Tandem Repeats: 772469 Masking Repeat Proteins... WUBlastXSearchEngine::setPathToEngine( /usr/local/abblast/blastx ): Program does not exist! at /opt/RepeatMasker/RepeatProteinMask line 344.
-rw-r--r-- 1 daomhai daomhai 70230272 Jan 14 02:32 Catfish22a.scaffolds.fa.rmsimple.cat.gz -rw-r--r-- 1 daomhai daomhai 801183214 Jan 14 02:32 Catfish22a.scaffolds.fa.rmsimple
Could you help me to figure out what I am doing wrong??
Sincerely Hai
Hi, I am glad that the first part was successful!
I think I recognize this problem: the default search engine in the RepeatProteinMask
program is AB-BLAST. To instead use RMBlast for searching, you can add the arguments -e ncbi
after RepeatProteinMask
.
-Jeb
I also wanted to say thank you for sharing your experience so far using the singularity container. Your problems and feedback in this issue are a huge help to us to find improvements and additions to the programs and/or READMEs, for everyone to benefit from in the future.
Dear Jeb
Thanks for your kindly advices (It is very useful for me)
For identify repetitive elements de novo, some researchers merged these results of 2 methods (RepeatModeler and Tandem Repeat Finder) to creating the combined TEs data. Could you help me to do that??
Here is what I done:
-rw-r--r-- 1 daomhai daomhai 792038 Jan 10 07:26 consensi.fa.backup_3 drwxr-xr-x 4 daomhai daomhai 25225 Jan 9 16:14 round-6.backup_1 -rw-r--r-- 1 daomhai daomhai 792038 Jan 9 09:47 consensi.fa.backup_2 -rw-r--r-- 1 daomhai daomhai 792038 Jan 7 09:05 consensi.fa.backup_1 drwxr-xr-x 4 daomhai daomhai 2046 Jan 4 07:59 round-5 -rw-r--r-- 1 daomhai daomhai 792038 Jan 4 07:59 consensi.fa -rw-r--r-- 1 daomhai daomhai 47093414 Jan 4 07:59 families.stk drwxr-xr-x 4 daomhai daomhai 726 Jan 4 03:48 round-4 drwxr-xr-x 4 daomhai daomhai 182 Jan 4 02:55 round-3 drwxr-xr-x 4 daomhai daomhai 50 Jan 4 02:44 round-2 drwxr-xr-x 2 daomhai daomhai 5898 Jan 4 02:42 round-1
Then running : /tetools_latest.sif RepeatClassifier -consensi consensi.fa.backup_3 -stockholm families.stk
And running: /RepeatMasker Catfish22a.scaffolds.fa -lib consensi.fa.classified -pa 4
-rw-r--r-- 1 daomhai daomhai 801183214 Jan 8 19:23 Catfish22a.scaffolds.fa.masked -rw-r--r-- 1 daomhai daomhai 516811427 Jan 8 19:23 Catfish22a.scaffolds.fa.cat.gz -rw-r--r-- 1 daomhai daomhai 258514137 Jan 8 19:23 Catfish22a.scaffolds.fa.out -rw-r--r-- 1 daomhai daomhai 2455 Jan 8 19:23 Catfish22a.scaffolds.fa.tbl
Retroelements 164384 47570954 bp 6.06 % SINEs: 19093 2952304 bp 0.38 % Penelope 1732 824005 bp 0.10 % LINEs: 75904 25395677 bp 3.23 % CRE/SLACS 0 0 bp 0.00 % L2/CR1/Rex 64237 22143970 bp 2.82 % R1/LOA/Jockey 705 217557 bp 0.03 % R2/R4/NeSL 256 178319 bp 0.02 % RTE/Bov-B 5745 1240629 bp 0.16 % L1/CIN4 1289 398085 bp 0.05 % LTR elements: 69387 19222973 bp 2.45 % BEL/Pao 1548 320826 bp 0.04 % Ty1/Copia 0 0 bp 0.00 % Gypsy/DIRS1 26721 10369213 bp 1.32 % Retroviral 462 404495 bp 0.05 % DNA transposons 371749 90203158 bp 11.48 % hobo-Activator 71468 12741978 bp 1.62 % Tc1-IS630-Pogo 177225 50700216 bp 6.45 % En-Spm 0 0 bp 0.00 % MuDR-IS905 0 0 bp 0.00 % PiggyBac 711 180746 bp 0.02 % Tourist/Harbinger 14737 2808285 bp 0.36 % Other (Mirage, 0 0 bp 0.00 % P-element, Transib) Rolling-circles 10595 2240926 bp 0.29 % Unclassified: 543297 85429128 bp 10.88 % Total interspersed repeats: 223203240 bp 28.42 % Small RNA: 2537 1695192 bp 0.22 % Satellites: 16579 2675209 bp 0.34 % Simple repeats: 614266 33212013 bp 4.23 % Low complexity: 47425 3262263 bp 0.42 %
Running Tandem Repeat Finder : /tetools_latest.sif trf Catfish22.scaffolds.fa 2 7 7 80 10 50 2000 -d -h –m
daomhai 146778113 Dec 31 20:24 Catfish22.scaffolds.fa.2.7.7.80.10.50.2000.dat daomhai 798574228 Dec 31 20:24 Catfish22.scaffolds.fa.2.7.7.80.10.50.2000.mask
Please show me how I can merge the result from RepeatModeler and Tandem Repeat Finder to creating combined TEs data??
Sincerely Hai
That is interesting because RepeatMasker already uses TRF for identifying Simple repeats, although with different parameters. I wonder how much of a difference you will see.
If the end goal is only the masking, one option could be to do the masking from the different programs one after another on the previous output.
Combining the annotations would be more difficult. Unfortunately, the TRF output is not a popular/well-known format, and I don't have a tool to interpret or convert it handy. But to combine the two sets of results, I would first find a way to convert the TRF *.dat
output into a format such as BED or GFF3 and convert the RepeatMasker output with one of the tools such as RepeatMasker/util/rmOutToGFF3.pl
. Once you have both sets of results in the same format, it should be easier to then use programs such as bedtools
to combine them.
Dear Jeb
+I run RepeatProteinMask as:
export LIBDIR=/scratch/ulg/bbasv/daomhai/Docker/RepeatMakerHomology/Libraries ((**for using custom Libraries created from previous step)
singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif RepeatProteinMask -engine ncbi Catfish22a.scaffolds.fa
+After 4 days, I got inform from Slurm (with my job was done)
Masking Simple Repeats...
Tandem Repeats: 772469
Masking Repeat Proteins...
Protein Hits = 230977
Done!
+But I just only got these file :
rw-r--r-- 1 daomhai daomhai 122420540 Jan 21 13:00 Catfish22a.scaffolds.fa.annot
-rw-r--r-- 1 daomhai daomhai 801183696 Jan 21 12:59 Catfish22a.scaffolds.fa.masked
-rw-r--r-- 1 daomhai daomhai 70230272 Jan 19 01:04 Catfish22a.scaffolds.fa.rmsimple.cat.gz
+I thought that it will create the … .tbl file with summary the results as (SINEs, LINEs ……..) as normal . Please help me to figure out what I am doing wrong ?? Sincerely, Hai
Hi again!
Did you mean to run RepeatMasker
instead of RepeatProteinMask
this time, or were you planning to combine the results somehow? The RepeatProteinMask
program makes an .annot
file similar to RepeatMasker
's .out
file of the known TE proteins it found, but it does not make a summary table.
But, I remembered something else: RepeatProteinMask
, unlike RepeatMasker
, uses a library of proteins which is part of the RepeatMasker download and is not part of Dfam or in RepBase RepeatMasker Edition. So, the custom library directory actually does not make a difference to the RepeatProteinMask
program. My apologies for that confusion!
Dear Jeb.
Thanks for your information
Actually, when I read some papers, I found the way data was classified as follow:
Data from other fish (table 1):
Type Repbase TEs TE Protiens De novo Combined TEs
DNA 78517890 25173484 154655800 172830141
LINE 20805206 13034681 49750580 63175319
SINE 4204313 0 1564922 5467949
LTR 15611749 7825615 38165345 48847192
Other 20723 0 0 20723
Unknown 0 0 15677460 15677460
Total 107682999 46004241 227613848 242468273
Data from Monkey (table2):
Type Repeat Size(bp) % of genome
Trf 188,281,472 6.20
Repeatmasker 1,343,796,331 32.46
Proteinmask 349,457,816 11.50
De novo 1,338,423,791 44.05
Total 1,380,391,966 50.81
Therefore, I would like to classify my repetitive element like these +For De novo, I did as follow: 1) RepeatModeler -database Catfish22a -engine ncbi –LTRStruc 2) RepeatClassifier -consensi consensi.fa.backup_3 -stockholm families.stk 3) RepeatMasker Catfish22a.scaffolds.fa -lib consensi.fa.classified +For homology, I did as follow: 1) export LIBDIR=PATH...../RepeatMakerHomology/Libraries (custom libraries with download Dfam.h5) 2) RepeatMasker Catfish22a.scaffolds.fa -species siluroidei So, I think I had Repbase TEs and De novo data.
Now, I want to know the information of TE Protiens ?? (so I run RepeatProteinMask ) and Combined TEs ?? (as a same with in Table 1). And, information about Trf ?? and Total ?? ( as a same in Table 2).
Could you help me to clarify this information and how to get them??
Thanks you in advance Sincerely Hai
Hi again, and sorry for the delay in this reply!
Now, I want to know the information of TE Protiens ?? (so I run RepeatProteinMask ) and Combined TEs ?? (as a same with in Table 1). And, information about Trf ?? and Total ?? ( as a same in Table 2).
Could you help me to clarify this information and how to get them??
Regretfully, I don't think I will be able to effectively help you with this aspect of the question. Combining the results of the different analyses should be done somewhat carefully, since the same regions of genomic sequence could be labeled by more than one method - and simply adding the totals together might double-count those sequences and overestimate the total. In this situation, I would usually refer to the Methods, Supplementary Materials, or source code for the papers in question to find the particular strategy they used to combine their results.
Dear Jeb.
Thanks you very much for your information. In my case: For De novo, I did as follow: 1) RepeatModeler 2) RepeatClassifier
The results:
file name: Catfish22a.scaffolds.fa
sequences: 471
total length: 785467626 bp (784572949 bp excl N/X-runs)
GC level: 38.96 %
bases masked: 307322087 bp ( 39.13 %)
+For homology, I did as follow: 1) export LIBDIR=PATH...../RepeatMakerHomology/Libraries (custom libraries with download Dfam.h5) 2) RepeatMasker Catfish22a.scaffolds.fa -species siluroidei
The results:
file name: Catfish22a.scaffolds.fa
sequences: 471
total length: 785467626 bp (784572949 bp excl N/X-runs)
GC level: 38.96 %
bases masked: 329962716 bp ( 42.01 %)
There was a difference between homology ( 42.01 %) and denovo ( 39.13 %). Which data (homology or denovo) I should use to represent for repeat sequence in my genome ???
Sincerely Hai
Dear everyone.
I try to run RepeatModeler with Singularity.
_(#!/bin/bash
SBATCH --time=48:00:00 # hh:mm:ss
SBATCH --ntasks=1
SBATCH --cpus-per-task=64
SBATCH --mem-per-cpu=4000 # megabytes
singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetoolslatest.sif RepeatModeler -database Catfish22 -engine ncbi -pa 16 ) And , I faced the issue ;
_Comparison Time: 05:27:20 (hh:mm:ss) Elapsed Time, 928802 HSPs Collected
please help me figure out what I am doing wrong?! Because this is round 6, may I use this result for other steps
Sincerely, Hai