Closed Ruiqi-CUB closed 11 months ago
I am not sure which versions of RepeatMasker would have supported the "all" synonym ( maps to NCBI taxid 1 "root" node ), as a way to search the entire database against your sequence, but newer versions (4.1.3 - 4.1.6) don't accept it, as you reported. I am conflicted about this, as I am not sure this is practical to perform with the current size of the Dfam database and current architecture of RepeatMasker without some care. If you really want to try this, you could get around the error message you are seeing by using any taxa below the root. E.g:
nohup RepeatMasker -pa 48 -a -e ncbi -dir all_mask_result -nolow -species 'cellular organisms' reference-genome.fna
There are two other things to consider. The first, is that this will produce a tremendous amount of false positives (multiple testing problem using many unrelated query sequences). The second, is that you are using '-nolow', which doesn't simply omit simple repeats from reporting, it also doesn't identify them prior to searching against TE families. Many TE families contain stretches of tandem or low-complexity sequences and will falsely label tandem repeat sequences if this option is used.
Thanks a lot Robert for the quick reply!
The reason that I would like to use -species all
is that there have been roported that increasing number of TE were horizontally transferred to my study system from bacterias and virus. Do you think I should run it with them along with cellular organisms
then concatenate the results?
Also, for the -nolow
option, I have another step just identifying simple repeats before this step. Do you think it is a good practice?
RepeatMasker only removes low-divergence simple repeats prior to searching against the TE libraries and then at the end searches for remaining higher divergence simple repeats at the end. In this fashion we avoid the false matching against TE families that contain simple repeats in their models (consensus/pHMM) while still obtaining better alignments to the TEs when the simple repeat contributes a larger alignment to the family. So, if you pre-mask the genome before running RepeatMasker you should take that into account.
Appreciate the insight Robert!
Thanks for developing the awesome software. I am running the following command with
-species all
option but encountered an error message. Could you please have a look?nohup RepeatMasker -pa 48 -a -e ncbi -dir all_mask_result -nolow -species all reference-genome.fna
.but I got the following messages:
Here is the software version information: RepeatMasker version 4.1.3-p1 Search Engine: NCBI/RMBLAST [ 2.14.1+ ] Using Master RepeatMasker Database: RepeatMasker/Libraries/RepeatMaskerLib.h5 Title : Dfam withRBRM Version : 3.6 Date : 2022-04-12 Families : 63,852