Species "all" is not known to RepeatMasker when running -species all

Ruiqi-CUB commented 11 months ago

Thanks for developing the awesome software. I am running the following command with -species all option but encountered an error message. Could you please have a look?

nohup RepeatMasker -pa 48 -a -e ncbi -dir all_mask_result -nolow -species all reference-genome.fna.

but I got the following messages:

Species "all" is not known to RepeatMasker.  There may
not be any TE families defined in the libraries for this
species/clade or there may be an error in the spelling.
Please check your entry against the NCBI Taxonomy database
and/or try using a broader clade or related species instead.
The full list of species/clades defined in the library may be
obtained using the famdb.py script.

Species/Taxa Search:
   [NCBI Taxonomy ID: ]

Here is the software version information: RepeatMasker version 4.1.3-p1 Search Engine: NCBI/RMBLAST [ 2.14.1+ ] Using Master RepeatMasker Database: RepeatMasker/Libraries/RepeatMaskerLib.h5 Title : Dfam withRBRM Version : 3.6 Date : 2022-04-12 Families : 63,852

rmhubley commented 11 months ago

I am not sure which versions of RepeatMasker would have supported the "all" synonym ( maps to NCBI taxid 1 "root" node ), as a way to search the entire database against your sequence, but newer versions (4.1.3 - 4.1.6) don't accept it, as you reported. I am conflicted about this, as I am not sure this is practical to perform with the current size of the Dfam database and current architecture of RepeatMasker without some care. If you really want to try this, you could get around the error message you are seeing by using any taxa below the root. E.g:

nohup RepeatMasker -pa 48 -a -e ncbi -dir all_mask_result -nolow -species 'cellular organisms' reference-genome.fna

There are two other things to consider. The first, is that this will produce a tremendous amount of false positives (multiple testing problem using many unrelated query sequences). The second, is that you are using '-nolow', which doesn't simply omit simple repeats from reporting, it also doesn't identify them prior to searching against TE families. Many TE families contain stretches of tandem or low-complexity sequences and will falsely label tandem repeat sequences if this option is used.

Ruiqi-CUB commented 11 months ago

Thanks a lot Robert for the quick reply! The reason that I would like to use -species all is that there have been roported that increasing number of TE were horizontally transferred to my study system from bacterias and virus. Do you think I should run it with them along with cellular organisms then concatenate the results?

Also, for the -nolow option, I have another step just identifying simple repeats before this step. Do you think it is a good practice?

rmhubley commented 11 months ago

RepeatMasker only removes low-divergence simple repeats prior to searching against the TE libraries and then at the end searches for remaining higher divergence simple repeats at the end. In this fashion we avoid the false matching against TE families that contain simple repeats in their models (consensus/pHMM) while still obtaining better alignments to the TEs when the simple repeat contributes a larger alignment to the family. So, if you pre-mask the genome before running RepeatMasker you should take that into account.

Ruiqi-CUB commented 11 months ago

Appreciate the insight Robert!

Dfam-consortium / RepeatMasker

Species "all" is not known to RepeatMasker when running -species all #241