How to get the name of consensus sequence？

zhangxinyu328 commented 2 months ago

Hello, I am using HiTE to annotate transposons. However, I did not provide a database such as repbase. I found that the values of the matching repeat columns in the results seemed to be randomly generated based on the consensus sequence library generated by HiTE,such as "TIR_75". Is there any way I can get these sequence's name now? Or do I need to run this program again given the database? I would be very grateful if you could help me. Below is the code I used and some of the results in HiTE.out. python /share/home/zhanglab/user/zhangxinyu/software/HiTE/HiTE/main.py --genome ./APP-001.genomic.fa --outdir ./test/ --thread 40 --plant 0 --domain 1 --recover 1 --annotate 1 --intact_anno 1 --search_struct 1

SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID

40 11.4 10.3 0.0 C39287448 2 98 (2) + (CTCTTCC)n Simple_repeat 1 107 (0) 1 423 22.4 10.2 0.0 C39287496 1 98 (2) + Helitron_0 LTR/ERV 291 398 (239) 2 337 15.2 3.4 0.0 C39287502 41 99 (1) C Helitron_6 Unknown (650) 68 8 3 304 16.1 7.9 1.5 C39287506 38 100 (0) + TIR_75 DNA/TcMar 52 118 (167) 4 464 21.0 5.0 0.0 C39287538 1 100 (0) + Homology_Non_LTR_18 LTR/ERV 418 522 (719) 5 17 9.3 12.8 0.0 C39287554 2 48 (52) + (TCCCCA)n Simple_repeat 1 53 (0) 6 563 16.2 0.0 0.0 C39287574 1 99 (1) C TIR_86 DNA/TcMar (602) 1193 1095 7 17 8.5 4.9 4.9 C39287662 1 41 (59) + A-rich Low_complexity 1 41 (0) 8 562 7.8 1.0 9.8 C39287682 1 100 (0) C LTR_7_INT LTR/ERV (306) 1082 991 9 506 14.8 1.0 5.2 C39287684 1 100 (0) C TIR_178 LTR/ERV (427) 336 241 10 240 29.6 0.0 0.0 C39287688 20 100 (0) C Homology_Non_LTR_79 LTR/ERV (280) 108 28 11 231 20.4 0.0 0.0 C39287710 2 45 (55) C TIR_166 DNA/hAT (486) 62 19 12 17 17.4 5.6 1.8 C39287710 47 100 (0) + (GATATA)n Simple_repeat 1 56 (0) 13 17 4.3 0.0 0.0 C39287714 1 24 (76) + (CCT)n Simple_repeat 1 24 (0) 14 457 24.0 0.0 0.0 C39287728 1 100 (0) + LTR_11_LTR LTR/Gypsy 49 148 (229) 15 376 16.1 0.0 0.0 C39287736 2 57 (43) C Homology_Non_LTR_58 LTR/ERV (449) 56 1 16 400 21.2 1.0 1.0 C39287776 1 100 (0) C Helitron_4 Unknown (46) 159 60 17 435 24.0 3.0 0.0 C39287804 1 100 (0) + Helitron_2 LTR/ERV 387 489 (515) 18 626 16.2 1.0 0.0 C39287806 2 100 (0) C TIR_23 Unknown (3) 474 375 19

CSU-KangHu commented 2 months ago

Hi @zhangxinyu328,

I'm curious about why you need to know the names of the TEs. Is it essential for your work? The TE names generated by HiTE are assigned automatically. If you're looking for specific TE names, you might find databases like Repbase or Dfam more suitable for your needs.

Best regards, Kang

zhangxinyu328 commented 2 months ago

Thanks for your reply. Please forgive me for being a novice in transposon research. I want to study the evolution of transposons in different species by comparing the same specific TE across species. I have previously annotated transposons using repeatmodeler and repeatmasker, etc. I want to try HiTE to see if it can get more types of TE. I will try to run HiTE again with a given database. Do you expect to get more TE sequences with a given database than without a given database?

CSU-KangHu commented 2 months ago

If I understand correctly, you’re looking to perform a panTE analysis using HiTE. I suggest running HiTE on the genomes of different species, which will generate a TE library ('confident_TE.cons.fa') for each genome. Afterward, merge all these TE libraries, remove redundancies to create a panTE library, and use RepeatMasker to annotate the genomes again with this panTE library. This will allow you to determine the distribution and proportion of the same TE across different species.

Best regards, Kang

CSU-KangHu / HiTE

How to get the name of consensus sequence？ #20