OliveiraDS-hub / ChimeraTE

A pipeline to detect chimeric transcripts derived from genes and transposable elements.
GNU General Public License v3.0
18 stars 4 forks source link

how to get this file --ref_TEs species database used by RepeatMasker #4

Closed xixifa closed 1 year ago

xixifa commented 1 year ago

Dear Daniel, Thanks for your help, I have successfully converted the format of NCBI. I already have the RMRBSeqs.embl file that RepeatMasker needs, but how do I get the file corresponding to the --ref_TEs in mode2? Does this option refer to the database of a specific species?

Thanks May

zuodabin commented 1 year ago

When you run Repeatmasker, add parameter -gff

OliveiraDS-hub commented 1 year ago

Dear @xixifa, I'm glad that you were able to do the conversion.

In Mode 2, you have to provide to --ref_TEs a fasta file with TE consensuses from your specific species, or at least the closest species with a TE library already described, because these TEs will be used to mask your assembled transcripts. In other words, you have to provide the file that you'd use with -lib parameter if you run RepeatMasker, or just use the species name that you'd use with -species. Check RepeatMasker documentation here: https://www.animalgenome.org/bioinfo/resources/manuals/RepeatMasker.html

If I'm not wrong, RMRBSeqs.embl refers to the whole RepBase, which cannot be used in Mode 2 as --ref_TEs input. You can easily generate a fasta file with consensus from your species (or sister species) by using Dfam data.

You can download the whole database here: https://www.dfam.org/releases/Dfam_3.6/families/ Then, choose if you want "curatedonly", or the complete data with putative TE consensus.

Finally, create a fasta from this downloaded file for your species with famdb.py, download here: https://github.com/Dfam-consortium/FamDB

Example to generate a fasta file with consensus for human: ./famdb.py -i Dfam_curatedonly.h5 families --include-class-in-name -f fasta_name -ad 'human' > human-consensuses-Dfam3.6.fa

Check the generated fasta file because it might contain satellites and other repeated sequences that are not TEs. They must be removed before run ChimeraTE Mode 2

Reminder: If you are working with a non-model species, you can try to annotate TE consensuses with RepeatModeler2, EDTA or REPET, and then use their output fasta file with TE consensus as input to --ref_TEs

xixifa commented 1 year ago

Dear Daniel, Thank you very much for the quick response. I have obtained a fasta file by using Dfam data, as per your suggestion. Is there any difference between the fasta data obtained by Dfam and Repbase (the "species" database used by RepeatMasker) ? Would there be a relatively large difference in the number of transposons?

Thanks May

OliveiraDS-hub commented 1 year ago

Dear May,

That's a good question. I would say that both databases have overlapped libraries for model species, and therefore I wouldn't expect large differences. Just keep in mind that in Dfam you can select "curated only" TEs, whereas in RepBase you don't. That makes a big difference in Drosophila, because if you are working with melanogaster is pointless to use "non curated" consensus from other Drosophila species to mask its genome.