Describe the issue
When I use RepeatModeler for de novo repeat sequences finding, It said that the program could not open a *.translation file for reading, which was generated in the BuildDatabase step.
I tried Arabidopsis thaliana genome and got no issues, with TAIR10.1 from NCBI
The genome size of the species I used is about 10Gb and I think maybe this is the problem.
The genome assembly I used for the program is Larix kaempferi
Log output
RepeatModeler Version 2.0.5
===========================
Using output directory = /mnt/annot/repeatm/RM_40.ThuJul41128262024
Search Engine = rmblast 2.14.1+
Threads = 40
Dependencies: TRF 4.09, RECON 1.08, RepeatScout 1.0.6, RepeatMasker 4.1.6
LTR Structural Analysis: Enabled ( GenomeTools 1.6.4, LTR_Retriever v2.9.0,
Ninja , MAFFT 7.471,
CD-HIT 4.8.1 )
Random Number Seed: 1720092502
Database = lka .
- Sequences = 4655
- Bases = 13492429495
- N50 = 15986365
- Contig Histogram:
Size(bp) Count
-----------------------------------------------------------------------
78119697-83699528 | [ 3 ]
72539866-78119696 | [ 1 ]
66960035-72539865 | [ 2 ]
61380204-66960034 | [ 1 ]
55800373-61380203 | [ 6 ]
50220542-55800372 | [ 6 ]
44640711-50220541 | [ 5 ]
39060881-44640711 | [ 14 ]
33481050-39060880 | [ 14 ]
27901219-33481049 | [ 28 ]
22321388-27901218 | [ 52 ]
16741557-22321387 |* [ 99 ]
11161726-16741556 |* [ 151 ]
5581895-11161725 |*** [ 304 ]
2065-5581895 |************************************************** [ 3969 ]
Storage Throughput = excellent ( 1483.92 MB/s )
Ready to start the sampling process.
INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly
and the repetitive content of the sequences. It is not imperative
that RepeatModeler completes all rounds in order to obtain useful
results. At the completion of each round, the files ( consensi.fa, and
families.stk ) found in:
/mnt/annot/repeatm/RM_40.ThuJul41128262024/
will contain all results produced thus far. These files may be
manually copied and run through RepeatClassifier should the program
be terminated early.
RepeatModeler Round # 1
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 40000000 bp
- Final Sample Size = 40007056 bp ( 40007056 non ambiguous )
- Num Contigs Represented = 595
- Sequence extraction : 00:00:03 (hh:mm:ss) Elapsed Time
-- Running RepeatScout on the sequences...
- RepeatScout: Running build_lmer_table ( l = 14 )..
- RepeatScout: Running RepeatScout.. : 2119 raw families identified
- RepeatScout: Running filtering stage.. 1982 families remaining
- RepeatScout: 00:03:40 (hh:mm:ss) Elapsed Time
- Large Satellite Filtering.. : 12 found in 00:00:08 (hh:mm:ss) Elapsed Time
- Collecting repeat instances...: 00:02:08 (hh:mm:ss) Elapsed Time
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Environment (please include as much of the following information as you can find out):
docker
How did you install RepeatModeler? e.g. manual installation from repeatmasker.org, bioconda, the Dfam TE Tools container, or as part of another bioinformatics tool?
I used a docker image of RepeatModeler called TEtools, which is maintained by Dfam-consortium. I used docker pull command to download the image using latest tag.
Which version of RepeatModeler do you have? The output of RepeatModeler without any options will be a help page with the version of the program displayed at the top.
No database indicated
/opt/RepeatModeler/RepeatModeler - 2.0.5
NAME
RepeatModeler - Model repetitive DNA
SYNOPSIS
RepeatModeler [-options] -database <XDF Database>
Operating system and version. The output of uname -a and lsb_release -a can be used to find this.
Linux cell-lab 6.8.0-36-generic #36-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 10 10:49:14 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
The genome I used for de novo repeat sequence discovery was too large (about 12GB at scaffold level), when I separated the .fa file into 3 part, which was about 4GB, the issue didn't show up again.
Describe the issue When I use RepeatModeler for de novo repeat sequences finding, It said that the program could not open a *.translation file for reading, which was generated in the BuildDatabase step.
I tried Arabidopsis thaliana genome and got no issues, with TAIR10.1 from NCBI
The genome size of the species I used is about 10Gb and I think maybe this is the problem.
Reproduction steps
the command I used for the discovery is
The genome assembly I used for the program is Larix kaempferi
Log output
Environment (please include as much of the following information as you can find out):
docker
I used a docker image of RepeatModeler called TEtools, which is maintained by Dfam-consortium. I used
docker pull
command to download the image using latest tag.RepeatModeler
without any options will be a help page with the version of the program displayed at the top.uname -a
andlsb_release -a
can be used to find this.