Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
189 stars 22 forks source link

Unusually high number of "unknown" elements from `RepeatClassifier` when using RepeatMasker 4.1.1 #128

Closed jebrosen closed 1 year ago

jebrosen commented 3 years ago

Cause of the issue

We identified a bug in RepeatMasker 4.1.1 which affects classifications from RepeatClassifier that are based on similarity to other known elements. RepeatMasker's configure program generates a file named RepeatMasker.lib in its Libraries directory which is used by RepeatClassifier; however in RepeatMasker 4.1.1 this file will be missing the required classification data. This bug will be fixed in the next RepeatMasker release.

What programs are impacted by this bug?

This issue affects classifications from RepeatClassifier that are based on similarity to other known elements, causing them to be classified as "Unknown" instead. Classifications that are based on similarity to known protein sequences (RepeatPeps.lib) are unaffected by this bug. This bug only affects the RepeatClassifier program, which is part of the RepeatModeler package. It does not affect other programs, including RepeatMasker or RepeatModeler themselves.

Am I affected by this bug?

If you are using RepeatClassifier (or RepeatModeler, which runs this program as its last step) and have configured RepeatModeler to use RepeatMasker 4.1.1, you are probably affected. You can confirm if this affects you by inspecting the beginning of RepeatMasker.lib by hand:

Example incorrect file (missing classification):

$ head -n1 /path/to/RepeatMasker/Libraries/RepeatMasker.lib
>ACRO1_ @Primates [S:50]

Example correct file (showing a classification #Satellite/acromeric):

$ head -n1 /path/to/RepeatMasker/Libraries/RepeatMasker.lib
>ACRO1_#Satellite/acromeric @Primates [S:50]

Suggested solutions

1) You can install RepeatMasker 4.1.2 or later, in which this bug has been fixed.

2) You can install a copy of an older version of RepeatMasker (such as 4.1.0) and configure RepeatModeler to use that installation of RepeatMasker instead of RepeatMasker 4.1.1.

3) You can manually regenerate the file RepeatMasker.lib with the necessary classification data:

$ cd /path/to/RepeatMasker/
$ ./famdb.py -i ./Libraries/RepeatMaskerLib.h5 families --descendants 1 --curated --format fasta_name --include-class-in-name > ./Libraries/RepeatMasker.lib
$ rm ./Libraries/RepeatMasker.lib.n*
$ /path/to/rmblast/bin/makeblastdb -dbtype nucl -in ./Libraries/RepeatMasker.lib

After applying either workaround and confirming that it has fixed the file (see above), you can re-run only RepeatClassifier without re-running all of RepeatModeler to reclassify results:

$ RepeatClassifier -consensi yourgenome-families.fa -stockholm yourgenome-families.stk

This will reclassify sequences according to the new (fixed) RepeatMasker.lib file and overwrite the files yourgenome-families.fa.classified and yourgenome-families-classified.stk.