Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
182 stars 23 forks source link

Unexpected filename for NINJA #213

Closed liamfriar closed 10 months ago

liamfriar commented 11 months ago

Describe the issue

NINJA installation has Ninja_new and Ninja_old programs, but no Ninja

Reproduction steps

I mamba installed RepeatModeler v2.0.2a mamba install -c bioconda repeatmodeler

RepeatModeler ran fine immediately which is awesome. RepeatModeler -pa 8 -engine ncbi -database $prefix 2>&1 | tee $prefix_repeatmodeler.log When I added in the -LTRStruct flag, I got the following error:

LTRPipeline dependency missing or incorrectly set for NINJA_DIR!
   Rerun ./configure or check your command line to ensure that RepeatModeler
  has access to and the correct version of this dependency.

I ran mamba list in my environment and discovered that no NINJA package is installed (so that's a problem for the conda/mamba people, I think)

113 seemed to have the same issue.

I installed the appropriate version of NINJA

wget https://github.com/TravisWheelerLab/NINJA/archive/refs/tags/0.95-cluster_only.tar.gz
tar -xf 0.95-cluster_only.tar.gz
rm 0.95-cluster_only.tar.gz
cd NINJA-0.95-cluster_only/NINJA
chmod +x * #Not sure if this was necessary

And ran RepeatModeler -pa 8 -LTRStruct -ninja_dir $ninja_dir -engine ncbi -database $prefix 2>&1 | tee $prefix_repeatmodeler.log I got the same error.

I then realized there is no file called simply Ninja So.... mv Ninja_new Ninja And now RepeatModeler runs fine including what looks like a successful run of the LTR pipeline, although nothing was found (which is neither expected nor unexpected), so I guess it's not totally clear if the run was successful:

LTR Structural Analysis
=======================
Running LtrHarvest...     : 00:00:02 (hh:mm:ss) Elapsed Time
Running Ltr_retriever...LTRPipeline: No results after LTR_Retriever filtering.   
LTRPipeline Time: 00:00:05 (hh:mm:ss) Elapsed Time

I am not sure if Ninja_new or Ninja_old is the proper Ninja to be running?

rmhubley commented 11 months ago

I'm afraid we do not use mamba/conda -- my guess is there is a problem with how someone setup the conda recipe. I would recommend installing Ninja using the source available from here: https://github.com/TravisWheelerLab/NINJA/releases/tag/0.98-cluster_only

Then re-run the RepeatModeler "configure" tool to set the location of where you installed Ninja. If you don't want to install from source you can use the pre-built TETools docker/singularity image here to get a complete installation of RepeatModeler + all dependencies already installed and configured ( https://github.com/Dfam-consortium/TETools ).

liamfriar commented 11 months ago

Hi @rmhubley sorry if I did not describe that well. The NINJA installation was not via conda . It was directly from the source.

wget https://github.com/TravisWheelerLab/NINJA/archive/refs/tags/0.95-cluster_only.tar.gz

I have been unsuccessful trying to get RepeatModeler to run. I got BuildDatabase to run. I have done a mix of conda and direct installs, so I don't expect you all to be able to figure this out, but the error I have been receiving is:

RepeatModeler -LTRStruct -ninja_dir $ninja_dir -rmblast_dir $rmblast_dir -repeatmasker_dir $repeatmasker_dir -database $prefix 2>&1 | tee $prefix_repeatmodeler.log

RepeatModeler Version 2.0.2
===========================
Search Engine = rmblast 2.14.0+
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5
LTR Structural Analysis: Enabled ( GenomeTools 1.5.10, LTR_Retriever ,
                                   Ninja , MAFFT 7.520,
                                   CD-HIT 4.8.1 )
Random Number Seed: 1691208703
Database = caroliniana1 .
  - Sequences = 358
  - Bases = 4721499
  - N50 = 19071
  - Contig Histogram:
  Size(bp)                                                        Count
  -----------------------------------------------------------------------
  62019-66271 |                                                   [ 1 ]
  57768-62019 |                                                   [  ]
  53517-57768 |                                                   [  ]
  49265-53516 |*                                                  [ 4 ]
  45014-49265 |*                                                  [ 3 ]
  40763-45014 |**                                                 [ 6 ]
  36511-40762 |***                                                [ 8 ]
  32260-36511 |**                                                 [ 6 ]
  28009-32260 |***                                                [ 9 ]
  23757-28008 |******                                             [ 15 ]
  19506-23757 |*******                                            [ 17 ]
  15255-19506 |**********                                         [ 25 ]
  11003-15254 |***************************                        [ 65 ]
  6752-11003  |**********************************                 [ 81 ]
  2501-6752   |************************************************** [ 118 ]

Using output directory = ~/data/RepeatModeler_out/caroliniana1/RM_1085902.SatAug50411442023
Storage Throughput = good ( 826.18 MB/s )

Ready to start the sampling process.
INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly
      and the repetitive content of the sequences.  It is not imperative
      that RepeatModeler completes all rounds in order to obtain useful
      results.  At the completion of each round, the files ( consensi.fa, and
      families.stk ) found in:
      ~/data/RepeatModeler_out/caroliniana1/RM_1085902.SatAug50411442023/ 
      will contain all results produced thus far. These files may be 
      manually copied and run through RepeatClassifier should the program
      be terminated early.

RepeatModeler Round # 1
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 40000000 bp
   - Final Sample Size = 4721422 bp ( 4720079 non ambiguous )
   - Num Contigs Represented = 358
   - Sequence extraction : 00:00:01 (hh:mm:ss) Elapsed Time
 -- Running RepeatScout on the sequences...
   - RepeatScout: Running build_lmer_table ( l = 13 )..
   - RepeatScout: Running RepeatScout.. : 112 raw families identified
   - RepeatScout: Running filtering stage.. 111 families remaining
   - RepeatScout: 00:01:11 (hh:mm:ss) Elapsed Time
   - Large Satellite Filtering.. : 0 found in 00:00:01 (hh:mm:ss) Elapsed Time
   - Collecting repeat instances...
 -- Refining Family R=5 / 0 ( RS Elements: 126, Using 100 )

ERROR from search engine (0) 
Can't call method "getNumAlignedSeqs" on an undefined value at ~/miniconda3/envs/repeatmodeler/share/RepeatModeler/Refiner line 776.
RepeatModeler: Could not open refined model ~/data/RepeatModeler_out/caroliniana1/RM_1085902.SatAug50411442023/round-1/family-0.fa.refiner_cons!

Unless that is a familiar error, I think I will try to find a server with Docker to run this on. Thanks.

rmhubley commented 11 months ago

Using your link above for the Ninja source I did the following:

% wget https://github.com/TravisWheelerLab/NINJA/archive/refs/tags/0.95-cluster_only.tar.gz
Resolving github.com (github.com)
...
2023-08-07 10:51:51 (8.01 MB/s) - ‘0.95-cluster_only.tar.gz’ saved [222127]

% tar zxvf 0.95-cluster_only.tar.gz 
NINJA-0.95-cluster_only/
NINJA-0.95-cluster_only/.gitignore
...
NINJA-0.95-cluster_only/README.md

% cd NINJA-0.95-cluster_only/

% ls
.gitignore  LICENSE  NINJA/  README.md

% cd NINJA/

% make
...
g++  -std=gnu++11 -Wall -mssse3 -fopenmp -O3 ArgumentHandler.o ArrayHeapExtMem.o BinaryHeap_FourInts.o BinaryHeap_IntKey_TwoInts.o BinaryHeap_TwoInts.o BinaryHeap.o CandidateHeap.o DistanceCalculator.o DistanceReader.o DistanceReaderExtMem.o ExceptionHandler.o Ninja.o SequenceFileReader.o Stack.o TreeBuilder.o TreeBuilderBinHeap.o TreeBuilderExtMem.o TreeBuilderManager.o ClusterManager.o TreeNode.o -o Ninja

% ls Ninja*
Ninja*  Ninja.cpp  Ninja_new*  Ninja.o  Ninja_old*

# NOTE: There are three executables in this release version, and "Ninja" is the one you want.  Originally you said it only generated Ninja_new and Ninja_old.  The correct version is the one that has the "m" option for --corr_type.  You can check this by running:

% ./Ninja -h
Ninja - Version 0.95-cluster_only

  ./Ninja --in file.fa --out file.out

Arguments: 
--help (or -h) to display this help
--in (or -i) filename
--out (or -o) filename
--in_type type [a | d] (default a)
--out_type type [d | c] (default c)
--corr_type type [n | j | k | s | m]
--cluster_cutoff dist_cutoff (default 0.03)
--threads (or -T) num_threads
--version (or -v) print the software version
For more information, check the README file.

# Then when you run RepeatModeler with the -LTRStruct option Ninja should have a version number reported in the screen output like:

LTR Structural Analysis: Enabled ( GenomeTools 1.5.10, LTR_Retriever v2.9.0,
                                                          Ninja 0.95-cluster_only, MAFFT 7.471,
                                                           CD-HIT 4.8.1 )

The more recent log seems to indicate a different error. I would recommend upgrading to RepeatModeler 2.0.4 first and see if that fixes your installation problem, 2.0.2 is quite old.

liamfriar commented 10 months ago

I did not know I had to run the make command. That appears to have fixed the NINJA problem. I will see about getting RepeatModeler to run. Thank you!