Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
230 stars 50 forks source link

RepeatMaskerLib.embl not built (DateRepeats) #150

Open EricDeveaud opened 2 years ago

EricDeveaud commented 2 years ago

Describe the issue

RepeatMaskerLib.embl is not built while configuring RepeatMasker-4.1.2-p1 and is requestrd by DateRepeats

rpm_maker:RepeatMasker/RepeatMasker-4.1.2-p1 > DateRepeats 
Indicate directory with the RepeatMasker repeat libraries near line 136 of /opt/gensoft/exe/RepeatMasker/4.1.2-p1/bin/DateRepeats

Reproduction steps

wget https://www.repeatmasker.org/RepeatMasker/RepeatMasker-4.1.2-p1.tar.gz
tar xf RepeatMasker-4.1.2-p1.tar.gz
mv RepeatMasker RepeatMasker-4.1.2-p1 && cd RepeatMasker-4.1.2-p1
tar xf ${HOME}/RepBaseRepeatMaskerEdition-20181026.tar.gz
wget https://www.dfam.org/releases/Dfam_3.1/families/Dfam.embl.gz
gunzip  -c Dfam.embl.gz > Libraries/Dfam.embl
module load rmblastn/2.10.0 \
            phrap/1.090518 \
            hmmer/3.2.1 \
            trf/4.09
perl configure -rmblast_dir $(dirname $(command -v rmblastn)) \
               -crossmatch_dir $(dirname $(command -v  cross_match)) \
               -hmmer_dir $(dirname $(command -v hmmconvert)) \
               -trf_prgm $(command -v trf) \
               -default_search_engine rmblast

Log output

 -- Setting perl interpreter...
RepeatMasker Configuration Program

Checking for libraries...

Rebuilding RepeatMaskerLib.h5 master library
  - Read in 49011 sequences from /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/Libraries/RMRBSeqs.embl
  - Read in 49011 annotations from /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/Libraries/RMRBMeta.embl
  Merging Dfam + RepBase into RepeatMaskerLib.h5 library..........................................

File: /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/Libraries/RepeatMaskerLib.h5
Database: Dfam withRBRM
Version: 3.3
Date: 2020-11-09

Dfam - A database of transposable element (TE) sequence alignments and HMMs.
RBRM - RepBase RepeatMasker Edition - version 20181026

Total consensus sequences: 51780
Total HMMs: 6915

.
Building FASTA version of RepeatMasker.lib .......................
Building RMBlast frozen libraries..
The program is installed with a the following repeat libraries:
File: /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/Libraries/RepeatMaskerLib.h5
Database: Dfam withRBRM
Version: 3.3
Date: 2020-11-09

Dfam - A database of transposable element (TE) sequence alignments and HMMs.
RBRM - RepBase RepeatMasker Edition - version 20181026

Total consensus sequences: 51780
Total HMMs: 6915

Further documentation on the program may be found here:
  /opt/gensoft/src/RepeatMasker/RepeatMasker_full-4.1.2-p1/repeatmasker.help

BUT !

ls Libraries/
Artefacts.embl   RMRBSeqs.embl            RepeatMasker.lib.nsq  RepeatPeps.lib.pin
Dfam.embl        RepeatAnnotationData.pm  RepeatMasker.lib.ntf  RepeatPeps.lib.pot
Dfam.h5          RepeatMasker.lib         RepeatMasker.lib.nto  RepeatPeps.lib.psq
README.RMRBSeqs  RepeatMasker.lib.ndb     RepeatMaskerLib.h5    RepeatPeps.lib.ptf
README.meta      RepeatMasker.lib.nhr     RepeatPeps.lib        RepeatPeps.lib.pto
RMRB.embl        RepeatMasker.lib.nin     RepeatPeps.lib.pdb    RepeatPeps.readme
RMRBMeta.embl    RepeatMasker.lib.not     RepeatPeps.lib.phr    taxonomy.dat

and

./DateRepeats
Indicate directory with the RepeatMasker repeat libraries near line 135 of ./DateRepeats

no RepeatMasker.embl required by DateRepeats

Environment (please include as much of the following information as you can find out):

perl: 5.30.1
Python: version 3.8.1 (hdf5py 3.6.0)
rmblastn: version 2.10.0
phrap: version 1.090518
hmmer: version 3.2.1
trf: version 4.09
 uname -a
Linux 1b305326d2fe 4.18.0-240.22.1.el8_3.x86_64 #1 SMP Thu Apr 8 19:01:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Additional context version 4.1.0 previously installed works as expected.

rmhubley commented 2 years ago

This is indeed a problem. DateRepeats is quite an old tool and may need some modifications in order to make it work with the new *.h5 database format. I will let you know if I can find a quick workaround.

galt commented 2 years ago

DateRepeats 4.1.2 is also failing at UCSC Genome Browser building our hg38 patch 14. We use it to strip out the human specific repeats.

I added the famdbfile setting to DateRepeats so it does not complain about famdbfile path not found: my $tax = Taxonomy->new( taxonomyDataFile => $taxFile, famdbfile => "$dir/RepeatMaskerLib.h5");

However, it runs for more than 27 hours using CPU the whole time until I killed it.

With RM version 4.1.0, all the small patch chromosomes finished in just about one minute each.

Please let me know if it would be handy to supply the commandline and input file for testing.

galt commented 2 years ago

Hanging command is: DateRepeats chr5_MU273352v1_fix.txt -query human -comp 'mus musculus'

chr5_MU273352v1_fix.txt

rmhubley commented 1 year ago

Thanks Galt. I removed DateRepeats in the latest version (4.1.4) as it needs refactoring. I will make sure this is a high priority for the next release.