Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
183 stars 23 forks source link

Could not open *.translation file for reading! #248

Open FeelLiao opened 6 days ago

FeelLiao commented 6 days ago

Describe the issue When I use RepeatModeler for de novo repeat sequences finding, It said that the program could not open a *.translation file for reading, which was generated in the BuildDatabase step.

I tried Arabidopsis thaliana genome and got no issues, with TAIR10.1 from NCBI

The genome size of the species I used is about 10Gb and I think maybe this is the problem.

Reproduction steps

the command I used for the discovery is

BuildDatabase -name lka sample.fa
nohup RepeatModeler --threads 30 -database lka &

The genome assembly I used for the program is Larix kaempferi

Log output

RepeatModeler Version 2.0.5
===========================
Using output directory = /mnt/annot/repeatm/RM_40.ThuJul41128262024
Search Engine = rmblast 2.14.1+
Threads = 40
Dependencies: TRF 4.09, RECON 1.08, RepeatScout 1.0.6, RepeatMasker 4.1.6
LTR Structural Analysis: Enabled ( GenomeTools 1.6.4, LTR_Retriever v2.9.0,
                                   Ninja , MAFFT 7.471,
                                   CD-HIT 4.8.1 )
Random Number Seed: 1720092502
Database = lka .
  - Sequences = 4655
  - Bases = 13492429495
  - N50 = 15986365
  - Contig Histogram:
  Size(bp)                                                        Count
  -----------------------------------------------------------------------
  78119697-83699528 |                                                   [ 3 ]
  72539866-78119696 |                                                   [ 1 ]
  66960035-72539865 |                                                   [ 2 ]
  61380204-66960034 |                                                   [ 1 ]
  55800373-61380203 |                                                   [ 6 ]
  50220542-55800372 |                                                   [ 6 ]
  44640711-50220541 |                                                   [ 5 ]
  39060881-44640711 |                                                   [ 14 ]
  33481050-39060880 |                                                   [ 14 ]
  27901219-33481049 |                                                   [ 28 ]
  22321388-27901218 |                                                   [ 52 ]
  16741557-22321387 |*                                                  [ 99 ]
  11161726-16741556 |*                                                  [ 151 ]
  5581895-11161725  |***                                                [ 304 ]
  2065-5581895      |************************************************** [ 3969 ]

Storage Throughput = excellent ( 1483.92 MB/s )

Ready to start the sampling process.
INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly
      and the repetitive content of the sequences.  It is not imperative
      that RepeatModeler completes all rounds in order to obtain useful
      results.  At the completion of each round, the files ( consensi.fa, and
      families.stk ) found in:
      /mnt/annot/repeatm/RM_40.ThuJul41128262024/ 
      will contain all results produced thus far. These files may be 
      manually copied and run through RepeatClassifier should the program
      be terminated early.

RepeatModeler Round # 1
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 40000000 bp
   - Final Sample Size = 40007056 bp ( 40007056 non ambiguous )
   - Num Contigs Represented = 595
   - Sequence extraction : 00:00:03 (hh:mm:ss) Elapsed Time
 -- Running RepeatScout on the sequences...
   - RepeatScout: Running build_lmer_table ( l = 14 )..
   - RepeatScout: Running RepeatScout.. : 2119 raw families identified
   - RepeatScout: Running filtering stage.. 1982 families remaining
   - RepeatScout: 00:03:40 (hh:mm:ss) Elapsed Time
   - Large Satellite Filtering.. : 12 found in 00:00:08 (hh:mm:ss) Elapsed Time
   - Collecting repeat instances...: 00:02:08 (hh:mm:ss) Elapsed Time
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!
Could not open lka.translation file for reading!

Environment (please include as much of the following information as you can find out):

docker

I used a docker image of RepeatModeler called TEtools, which is maintained by Dfam-consortium. I used docker pull command to download the image using latest tag.

No database indicated

/opt/RepeatModeler/RepeatModeler - 2.0.5
NAME
    RepeatModeler - Model repetitive DNA

SYNOPSIS
      RepeatModeler [-options] -database <XDF Database>
Linux cell-lab 6.8.0-36-generic #36-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 10 10:49:14 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
FeelLiao commented 1 day ago

I found why

The genome I used for de novo repeat sequence discovery was too large (about 12GB at scaffold level), when I separated the .fa file into 3 part, which was about 4GB, the issue didn't show up again.