Closed Song-10-YF closed 1 year ago
Hi @Song-10-YF
Thanks for your query. Based on what you say and my understanding of RepeatModeler output, the sequences you list are not duplicate families, but rather distinct repeat consensus sequences with similar naming conventions. If you have further questions about the content of the RepeatModeler file, please direct them to the developers of RepeatModeler.
To use the output of your RepeatModeler library as input for McClintock, you could either supply the library as the argument to the -c
option and let McClintock automatically create the necessary annotation (.gff) and taxonomy (.tsv) files. Or you could independently run RepeatMasker to create your annotation (.gff) file and parse the output of your .gff file to create a two-column file .tsv file containing the IDs of annotated elements in the .gff file in the first column and the TE family it belongs to in the the second column, then supply these to McClintock with the -g
and -t
options, respectively.
I hope this helps, Casey
Hi, @cbergman Yes, I would also like to call mcclintock's built-in RepeatMasker by directly "-c" the consensus sequence of RepeatModeler, but RepeatMasker is based on the "#" in the header line of the consensus sequence " for the symbol that identifies the TE superfamily, e.g:
rnd-5_family-7973#DNA/TcMar-nMITE rnd-5_family-3543#LTR/Gypsy rnd-5_family-8667#DNA/hAT-MITE
This in itself is different from the problematic symbols mentioned in your software (Problematic symbols: ; & ( ) | * ? [ ] ~ { } < ! ^ " ' \ $ / + - #) conflicts. So how do I solve this problem.
Looking forward to your reply, thanks!
Hi, @cbergman Thank you very much for your pipeline of finding TE inserts from resequencing data. I have a question about TE consensus sequences. I got the consensus sequence from the reference genome denovo build in RepeatModeler, but there are quite a few duplicate superfamilies, for example