How to deal with huge differences in the running results of different software？

abcyulongwang commented 1 month ago

Dear developer

Thank you for designing such a great software, which is excellent in some aspects of performance and running time!

When I performed TE detection on an animal genome of about ~2.5G in size, I used HiTE and EDTA2.2 to run and got the results, as shown in the figure:

The total number of TEs seems similar, but the ratio of SINEs to LINEs seems quite different. Because these two TEs account for a large proportion of the genome, this result of almost double difference confuses me. To be honest, the SINE ratio of HITE is similar to that of previous studies, but the result of LINE is much lower. Do I need to manually manage the TEs predicted by EDTA and HiTE? Or is there a strategy to integrate the results of the two softwares to help me get as many real TE sequences as possible. In addition, I used the RepeatModeler+RepeatMasker strategy, and the SINE and LINE ratio obtained was closer to that of EDTA2.2.

Sincerely hope to get your professional advice, which will be very helpful to me.

Best wishes yulong

CSU-KangHu commented 1 month ago

Hi @abcyulongwang,

Thank you very much for using HiTE. Based on my understanding, I hope my responses to your questions are helpful.

In fact, I believe that evaluating the performance of a tool based on the proportion of genome annotations can be highly misleading. For example, consider a genome with 10 intact TE sequences. Tool A identifies these 10 intact TE sequences, while Tool B identifies 100 fragments of these intact TEs. From the perspective of the genome annotation proportion, both Tool A and Tool B might achieve similar annotation ratios. However, in terms of accurate genome annotation, Tool A is clearly superior. Therefore, HiTE aims to identify more full-length TEs. To address your question, I will respond from the following points:

It is quite common to observe different results for non-LTR (LINEs and SINEs) detection across various tools. This is because detecting non-LTRs is particularly challenging due to the weak structural features, such as polyA tails or target site duplications (TSDs), used for their identification. This often leads to many false positives in non-LTR detection tools.
As far as I know, EDTA directly uses the non-LTR results identified by RepeatModeler2. Therefore, it is not surprising that EDTA produces similar non-LTR results to RepeatModeler2. RepeatModeler2 identifies transposons based on repetitive characteristics, so the sequences obtained may not have complete TE structures (i.e., fragmented sequences), which are still considered false positives for HiTE.
To obtain reliable full-length non-LTR transposons, HiTE imposes strict criteria. First, it identifies candidate repetitive sequences based on their repetitive nature, then locates their full-length copies in the genome for multiple sequence alignment and searches for accurate TE boundaries. We require that most copies exhibit polyA+TSD structural features to confirm them as true non-LTR transposons.
However, some non-LTRs do not possess TSDs, so HiTE may inevitably miss some genuine non-LTR elements. To address this issue, HiTE employs a homology-based identification strategy, searching for highly homologous full-length non-LTR elements in the genome based on an existing non-LTR library. Therefore, compared to existing methods, HiTE has significantly improved non-LTR identification. More details can be found in the peer review comments in the HiTE article, specifically in the second-round responses addressing Reviewer 1's comment 3.

Although we believe that HiTE identifies non-LTRs with high confidence, the best method currently remains manual curation of non-LTR elements. Of course, we plan to update HiTE in the future to address this issue.

Best regards, Kang

abcyulongwang commented 1 month ago

Thank you for your reply, it explains my confusion to some extent. I will continue to try according to your suggestions.

Best wishes! yulong

CSU-KangHu / HiTE

How to deal with huge differences in the running results of different software？ #5