Closed xixifa closed 1 year ago
When you run Repeatmasker, add parameter -gff
Dear @xixifa, I'm glad that you were able to do the conversion.
In Mode 2, you have to provide to --ref_TEs
a fasta file with TE consensuses from your specific species, or at least the closest species with a TE library already described, because these TEs will be used to mask your assembled transcripts. In other words, you have to provide the file that you'd use with -lib
parameter if you run RepeatMasker, or just use the species name that you'd use with -species
. Check RepeatMasker documentation here: https://www.animalgenome.org/bioinfo/resources/manuals/RepeatMasker.html
If I'm not wrong, RMRBSeqs.embl refers to the whole RepBase, which cannot be used in Mode 2 as --ref_TEs
input. You can easily generate a fasta file with consensus from your species (or sister species) by using Dfam data.
You can download the whole database here: https://www.dfam.org/releases/Dfam_3.6/families/ Then, choose if you want "curatedonly", or the complete data with putative TE consensus.
Finally, create a fasta from this downloaded file for your species with famdb.py
, download here: https://github.com/Dfam-consortium/FamDB
Example to generate a fasta file with consensus for human:
./famdb.py -i Dfam_curatedonly.h5 families --include-class-in-name -f fasta_name -ad 'human' > human-consensuses-Dfam3.6.fa
Check the generated fasta file because it might contain satellites and other repeated sequences that are not TEs. They must be removed before run ChimeraTE Mode 2
Reminder: If you are working with a non-model species, you can try to annotate TE consensuses with RepeatModeler2, EDTA or REPET, and then use their output fasta file with TE consensus as input to --ref_TEs
Dear Daniel, Thank you very much for the quick response. I have obtained a fasta file by using Dfam data, as per your suggestion. Is there any difference between the fasta data obtained by Dfam and Repbase (the "species" database used by RepeatMasker) ? Would there be a relatively large difference in the number of transposons?
Thanks May
Dear May,
That's a good question. I would say that both databases have overlapped libraries for model species, and therefore I wouldn't expect large differences. Just keep in mind that in Dfam you can select "curated only" TEs, whereas in RepBase you don't. That makes a big difference in Drosophila, because if you are working with melanogaster is pointless to use "non curated" consensus from other Drosophila species to mask its genome.
Dear Daniel, Thanks for your help, I have successfully converted the format of NCBI. I already have the RMRBSeqs.embl file that RepeatMasker needs, but how do I get the file corresponding to the --ref_TEs in mode2? Does this option refer to the database of a specific species?
Thanks May