Closed adriludwig closed 1 month ago
Thanks for sharing these concerns adriludwig. I believe that these issues are valid, but unfortunately we do not have the capacity to upgrade the software at this point in time. Hopefully the next generation of TE methods will be able to address them.
As suggested by Dr. Martin Hemberg following our recent email correspondence, I’d like to summarize some issues I encountered with TransposonUltimate (ReasonaTE and RFSB). I fully recognize and appreciate the significant effort that has gone into developing a tool for the transposon research community. However, I felt compelled to share some concerns I encountered, as these issues could potentially lead to incorrect conclusions.
1- Misclassification of known TEs and classification of Non-TE sequences
Drosophila melanogaster has a well-curated library and is the subject of several works involving TEs. One of the most representative TEs identified with reasonaTE is Zator, a group of DNA transposons never described for this species. Upon inspecting the sequences identified as Zator, they are predominantly LTR/Bel-Pao (a group that is missing in TransposonUltimate), as well as P, CR1, and Tc1, among others.
I checked the reasonaTE GitHub test files, and the TEs identified in the results of the test data also contained misclassification (checked using CENSOR and CD-search). Examples include: Transposon1 is not a TE, but a repetitive region containing Serpentine Receptor domain; Transposon12 classified as Gypsy is a LINE with flanking regions containing Serpentine Receptor domain; Transposon21 comprises copies of glucuronosyltransferase gene; Transposon227 is also a gene family (Glycosyltransferase family); Transposon11, Transposon253 and Transposon254 are satellites (MSAT1_CE) classified as Zator.
Even for plants, several inconsistencies can be found in the classification, as observed when checking the first sequences from Oryza sativa identified in the TransposonUltimate paper. Examples include: Transposon2 is a genomic region with De-etiolated protein domain; Transposon4 – is correctly classified as gypsy; however, the sequence contains a large region of non-TE sequence; Transposon5 and 6, classified as Helitron and Copia, respectively, are probably kinase genes region; Transposon8 classified as hAT is not a TE and has aminoacyl-tRNA ligase domains.
TransposonUltimate (RFSB tool) appears to classify virtually any sequence as a TE, including non-TE elements like satellites, simple repeats, and even random fake sequences. This is particularly concerning because reasonaTE combines the outputs of multiple tools to identify TEs based on repetitiveness, seemingly without a proper filtering step to remove non-TE sequences prior to classification. It is well known that programs like RepeatModeler identify all kinds of repeats with a large fraction of gene families and satellites depending on the species. Thus, a filtering step is crucial to eliminate these sequences before the classification step.
2 – Potential problem with training the TE dataset