Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
230 stars 50 forks source link

Problem with Dfam 3.7 famdb.py version #191

Closed JWDebler closed 1 year ago

JWDebler commented 1 year ago

I am trying to configure RepeatMasker 4.1.4 with Dfam 3.7 (h5).

Reproduction steps

Download Dfam3.7 (curated)

cd RepeatMasker/Libraries
wget https://www.dfam.org/releases/Dfam_3.7/families/Dfam_curatedonly.h5.gz
gunzip Dfam_curatedonly.h5.gz
mv Dfam_curatedonly.h5 Dfam.h5
cd ..
perl ./configure 

This throws following error:

Checking for libraries...
Rebuilding RepeatMaskerLib.h5 master library
  Merging Dfam + RepBase into RepeatMaskerLib.h5 library.....ERROR:__main__:Error reading file: This file cannot be read by this version of famdb.py.

I redownloaded Dfam3.6 and repeated the above mentioned process and everything worked fine.

I then downloaded the latest version of famdb.py (https://github.com/Dfam-consortium/FamDB) and configuration seems to work, it is just throwing a lot of warnings about taxa like this: WARNING:__main__:Could not find taxon for 'melampsora_larici-populina'

Finishing the configuration threw a few additional errors:

Building FASTA version of RepeatMasker.lib .......Traceback (most recent call last):
  File "/opt/RepeatMasker/famdb.py", line 1875, in <module>
    main()
  File "/opt/RepeatMasker/famdb.py", line 1868, in main
    args.func(args)
  File "/opt/RepeatMasker/famdb.py", line 1663, in command_families
    print_families(args, families, True, target_id)
  File "/opt/RepeatMasker/famdb.py", line 1601, in print_families
    entry += family.to_fasta(
  File "/opt/RepeatMasker/famdb.py", line 391, in to_fasta
    for clade_id in self.clades:
TypeError: 'NoneType' object is not iterable
.
Building RMBlast frozen libraries..
The program is installed with a the following repeat libraries:
File: /opt/RepeatMasker/Libraries/RepeatMaskerLib.h5   
FamDB Generator: famdb.py v0.4.2
FamDB Format Version: 0.5
FamDB Creation Date: 2023-01-08 10:42:05.645898 
Database: Dfam withRBRM
Version: 3.7
Date: 2023-01-11
Dfam - A database of transposable element (TE) sequence alignments and HMMs.
RBRM - RepBase RepeatMasker Edition - version 20181026
Total consensus sequences: 64595
Total HMMs: 19730

Note that it mentions FamDB Generator: famdb.py v0.4.2, but the new famdb.py I used is verison 0.4.3

RepeatMasker ran with the newer famdb.py and 3.7, but the results are not useful. After reverting back to famdb.py 0.4.2 and Dfam 3.6 everything is OK.

Hope this helps. Cheers

rmhubley commented 1 year ago

Sorry for the confusion. The new release of Dfam uses FamDB v0.5 format which is not backward compatible with the current release of RepeatMasker (4.1.4). We will be releasing a maintenance update to RepeatMasker to fix this but as you discovered you can also update existing versions of RepeatMasker by simply updating the famdb.py script that was bundled with these earlier versions followed by a rerun of the RepeatMasker configure script. Could you please elaborate on the "but the results are not useful" conclusion with doing this?

rmhubley commented 1 year ago

Please see https://www.repeatmasker.org/RepeatMasker site for details on how to use Dfam 3.7 with RepeatMasker 4.1.4.

JWDebler commented 1 year ago

Hmm, not sure what happened, when I ran it with 3.7 yesterday the run didn't finish so I thought it had something to do with all the warnings it printed during configuration. Just tried again and it worked fine.

rmhubley commented 1 year ago

Let me know if you have any further problems. I am going to close this for now.