Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
182 stars 23 forks source link

Dfam 3.0 compatibility #34

Closed isgilman closed 5 years ago

isgilman commented 5 years ago

Hello, I'm trying to run the RepeatModeler + RepeatMasker process after updating Dfam from 2.0 to 3.0. I hit no issues running with Dfam 2.0 but after downloading the new Dfam.hmm.gz and unzipping it in the RepeatMasker/Libraries directory I'm having an issue generating a consensus file. BuildDatabase ran fine but then when I reran my old batch file that had previously worked (RepeatModeler -pa 36 -engine ncbi -database Portulaca_amilis) the output looked like the following,

RepeatModeler Round # 4
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 27000000 bp
 -- Running TRFMask on the sequence...
 -- Sample Stats:
       Sample Size 27267751 bp
       Num Contigs Represented = 278
       Non ambiguous bp:
             Initial: 27017669 bp
             After Masking: 27017669 bp
             Masked: 0.00 %
 -- Input Database Coverage: 39381039 bp out of 403885173 bp ( 9.75 % )
Sampling Time: 00:01:43 (hh:mm:ss) Elapsed Time
Running all-by-other comparisons...
        0% completed,  00:39:20 (hh:mm:ss) est. time remaining.
        0% completed,  00:23:50 (hh:mm:ss) est. time remaining.
        0% completed,  00:15:52 (hh:mm:ss) est. time remaining.
        0% completed,  00:33:35 (hh:mm:ss) est. time remaining.
        1% completed,  00:26:48 (hh:mm:ss) est. time remaining.
        1% completed,  00:24:54 (hh:mm:ss) est. time remaining.
        ...
       99% completed,  00:0:00 (hh:mm:ss) est. time remaining.
      100% completed,  00:0:00 (hh:mm:ss) est. time remaining.
Comparison Time: 00:04:12 (hh:mm:ss) Elapsed Time, 358471 HSPs Collected
  - RECON: Running imagespread..
RECON Elapsed: 00:00:01 (hh:mm:ss) Elapsed Time
  - RECON: Running initial definition of elements ( eledef )..
RECON Elapsed: 00:00:23 (hh:mm:ss) Elapsed Time
  - RECON: Running re-definition of elements ( eleredef )..
RECON Elapsed: 00:20:36 (hh:mm:ss) Elapsed Time
  - RECON: Running re-definition of edges ( edgeredef )..
RECON Elapsed: 00:01:43 (hh:mm:ss) Elapsed Time
  - RECON: Running family definition ( famdef )..
RECON Elapsed: 00:00:06 (hh:mm:ss) Elapsed Time
  - Obtaining element sequences
Number of families returned by RECON: 3931
Processing families with greater than 15 elements

Processing RECON family: 209
  - Saving elements to a file...
    - 207 elements found.
Element Gathering: 00:00:01 (hh:mm:ss) Elapsed Time
Refining family-209 model...
  WARNING: Refiner did not return a consensus.
Refinement: 00:00:00 (hh:mm:ss) Elapsed Time

Processing RECON family: 489
  - Saving elements to a file...
    - 87 elements found.
Element Gathering: 00:00:00 (hh:mm:ss) Elapsed Time
Refining family-489 model...
  WARNING: Refiner did not return a consensus.
Refinement: 00:00:00 (hh:mm:ss) Elapsed Time

All of the families found by RECON cannot find a family with Refiner. In the Dfam 3.0 documentation, the first noted change from 2.0 to 3.0 is

Dfam and Dfam_consensus have been merged into one comprehensive database for transposable element family consensus sequences and profile Hidden Markov Models. Consensus sequences were generated using the seed alignment data for each family. These sequences may differ from RepBase and in many cases reflect an improvement owing to deeper seed alignments that they are called from. The consensus sequences appear in the model tab of each family and an EMBL file download link is provided. The complete set of consensi for Dfam is also included in the families directory of this release.

Which seems relevant because it appears we're finding sequences but failing to assign them consensus sequences for output. I noticed some commits related to the updated database on the RepeatMasker repo but I'm not sure what the status is for the pipeline overall. Am I interpreting this correctly, or is this another issue with RepeatMasker or Refiner?

Thanks, Ian

rmhubley commented 5 years ago

This shouldn't have anything to do with your upgrading from Dfam 2.0 to 3.0 -- although I will address that further at the end. Refiner is the component of RepeatModeler that is handed pre-clustered instances of a single TE family and is responsible for aligning and refining that alignment. At the end it should call a consensus from the final alignment. It does not use RepeatMasker or any of it's libraries to do this task. I think the problem precedes this step. Something that is really glaring is that in the log you posted you reached Round #4 of RepeatModeler and haven't accumulated any consensi that are capable of masking the sample for this round ( "Masked: 0.00 %" ). I would like to take a look at this further. Could you send me ( rhubley@systemsbiology.org ) your full log output, and ideally the "RM_###*" directory tar'd&gzipd?

As for upgrading from Dfam 2.0 to 3.0, I would recommend not simply replacing the Dfam.hmm file but rather download the latest RepeatMasker package 4.0.9-p2 ( which includes it ). Dfam 3.0 is not compatible with previous versions of RepeatMasker. FYI, RepeatMasker is only used by RepeatModeler for a few steps. It reuses some utility modules for masking tandem repeats in genome samples, and running RMBlast/ABBlast all-vs-all searches, and it uses the RepeatMasker libraries in the RepeatClassifier step at the very end of a RepeatModeler run.

isgilman commented 5 years ago

Thanks for the quick reply! It's great to know how the software call each other. I'll get a tarball over to you and work on installing 4.0.9-p2. I've been using 4.0.8 as part of a conda install of FUNannotate, which I know is a double-edged sword, especially when the developers aren't building the module.