bergmanlab / mcclintock

Meta-pipeline to identify transposable element insertions using next generation sequencing data
92 stars 30 forks source link

About the consensus fasta file #117

Closed Song-10-YF closed 1 year ago

Song-10-YF commented 1 year ago

Hi, @cbergman Thank you very much for your pipeline of finding TE inserts from resequencing data. I have a question about TE consensus sequences. I got the consensus sequence from the reference genome denovo build in RepeatModeler, but there are quite a few duplicate superfamilies, for example

rnd-1_family-22#LTR/Gypsy rnd-1_family-54#LTR/Gypsy rnd-1_family-443#LTR/Copia rnd-1_family-446#LTR/Copia etc. But their sequences are not consistent. What should I do with these sequences (choose only one of the repeated superfamilies ?) in order to get TE annotation files (.gff) and TE taxonomy files (.tsv) that conform to the software specification. I look forward to your reply.

cbergman commented 1 year ago

Hi @Song-10-YF

Thanks for your query. Based on what you say and my understanding of RepeatModeler output, the sequences you list are not duplicate families, but rather distinct repeat consensus sequences with similar naming conventions. If you have further questions about the content of the RepeatModeler file, please direct them to the developers of RepeatModeler.

To use the output of your RepeatModeler library as input for McClintock, you could either supply the library as the argument to the -c option and let McClintock automatically create the necessary annotation (.gff) and taxonomy (.tsv) files. Or you could independently run RepeatMasker to create your annotation (.gff) file and parse the output of your .gff file to create a two-column file .tsv file containing the IDs of annotated elements in the .gff file in the first column and the TE family it belongs to in the the second column, then supply these to McClintock with the -g and -t options, respectively.

I hope this helps, Casey

Song-10-YF commented 1 year ago

Hi, @cbergman Yes, I would also like to call mcclintock's built-in RepeatMasker by directly "-c" the consensus sequence of RepeatModeler, but RepeatMasker is based on the "#" in the header line of the consensus sequence " for the symbol that identifies the TE superfamily, e.g:

rnd-5_family-7973#DNA/TcMar-nMITE rnd-5_family-3543#LTR/Gypsy rnd-5_family-8667#DNA/hAT-MITE

This in itself is different from the problematic symbols mentioned in your software (Problematic symbols: ; & ( ) | * ? [ ] ~ { } < ! ^ " ' \ $ / + - #) conflicts. So how do I solve this problem.

Looking forward to your reply, thanks!