DerKevinRiehl / transposon_annotation_reasonaTE

Transposon annotation tool "resonaTE" (part of TransposonUltimate)
GNU General Public License v3.0
16 stars 1 forks source link

mutator_like elements and repeatmask #2

Closed iscambes closed 2 years ago

iscambes commented 3 years ago

Dear Kevin, thank you so much for this fantastic software. I was looking for this specific tool, a wrapper of many different TE annotation softwares.

I have a three question regarding the annotation of TEs.

############ (QUESTION1) ################

I am highly interested in the annotation of all MUTATOR-like elements of one species of nematode, however, this family of DNA-transposons is missing in your code, right? (maybe I am missing something... The only categories your program is able to detect are:

1:Class I, Retrotransposon 1/1:LTR, Retrotransposon 1/1/1:Copia, LTR, Retrotransposon 1/1/2:Gypsy, LTR, Retrotransposon 1/1/3:ERV, LTR, Retrotransposon 1/2:Non-LTR, Retrotransposon 1/2/1:LINE, Non-LTR, Retrotransposon 1/2/2:SINE, Non-LTR, Retrotransposon 2:Class II, DNA Transposon 2/1:TIR, DNA Transposon 2/1/1:Tc1-Mariner, TIR, DNA Transposon 2/1/2:hAT, TIR, DNA Transposon 2/1/3:CMC, TIR, DNA Transposon 2/1/4:Sola, TIR, DNA Transposon 2/1/5:Zator, TIR, DNA Transposon 2/1/6:Novosib, TIR, DNA Transposon 2/2:Helitron, DNA Transposon 2/3:MITE, DNA Transposon

am I right? what about MULEs elements?

############ (QUESTION2) ################

Also, I do not understand how can I get the two following results:

1.- With TIRvish I obtain a clear DNA transposon, with its TSD and TIR sequences:

seq1 TIRvish tsd 13153 13155 . + . transposon=6 ;description=Left TSD of transposon 6 seq1 TIRvish tir 13156 13995 . + . transposon=6 ;description=Left TIR of transposon 6 seq1 TIRvish tir 19926 20769 . + . transposon=6 ;description=Right TIR of transposon 6 seq1 TIRvish tsd 20770 20772 . + . transposon=6 ;description=Right TSD of transposon 6

2.- However, when it comes to annotate the transposon, I get the following:

seq1 reasonaTE transposon 13153 20772 . + . transposon=6;class=1/1/2(Gypsy,LTR,Retrotransposon)

I do not understand how, after inferring the TIR and TSD sequences (meaning a "clear" DNA-transposon, the software can determine that the transposon corresponds with a LTR retrotransposon.

why that?

By playing around a bit with blast I suspect that this element corresponds with a MULE element.

############ (QUESTION3) ################

When I run the whole transposon_annotation_reasonaTE pipeline I obtain in the RepeatMasker folder that only 1.4% of the C.elegans genome is masked. However, when I run independently RepeatMasker with the default format, I obtain that around 18.93% of the genome is mask. Why this difference? I used the default RepeatMasker code you suggest:

reasonaTE -mode annotate -projectFolder workspace${genome} -projectName testProject${genome} -tool repeatmodel reasonaTE -mode annotate -projectFolder workspace${genome} -projectName testProject${genome} -tool repMasker

Thank you in advance, Isra

DerKevinRiehl commented 3 years ago

Dear Isra, first of all thank you for your interest in my software.

############ (QUESTION1) ################ When developing this software we tried to achieve best performance. The database we used for training simply didnt not provide sufficient examples for MULE and Mutator, therefore we did not provide a specific subcategory for them.

I recommend you check out the preprint manuscript on BioRXiv: https://www.biorxiv.org/content/10.1101/2021.04.30.442214v1

In footnote 2 of the taxonomy you can see, that we just classify MULE and Mutator as "TIR" (class 2/1), but not more specifically. grafik

I totally agree, that it would be very nice to have further subcategories for all of the various Transposon families in the footnote, however this would decrease the classification performance, as there are simply not sufficient examples in the database used for training the model.

Does this answer your question? Do you have suggestions or wishes on that point? :-)

############ (QUESTION2) ################ That is a very good question. So the first GFF code you show refers to the output of TIRvish that you can find in the projectFolder > tirvish (resp. tirvish_rc). Besides transposon annotations it covers structural features, right?

The second refers to projectFolder > finalResults > FinalAnnotations_Transposons.gff3 So the reasonaTE annotation pipeline uses a transposon sequence classifier, RFSB, that you asked about in question 1 (check the page for RFSB classifier). Therefore reasonaTE does not necessarily use the structural features but other ways, using k-mer frequencies, to classify annotations to specific transposon classes (see the manuscript for details).

The structural features however, are not gone or something, they are just in another file that you can find here: projectFolder > finalResults > FinalAnnotations_StructuralFeatures.gff3.

So now the answer to your question: some tools like TIRvish annotate sequences that they consider as TIR transposons (DNA transposon). When we run our RFSB classifier, that is proven to classify on a very high performance, it finds that the annotation most probably is a LTR transposon. Investigating different tools we found that many tool annotations do not necessarily correspond to the dedicated transposon class (see manuscript, the heatmaps at the end). grafik

Does this make sense to you? Please answer on that :-) By the way, could you please be a bit more specific on how you find this to be a MULE transposon? That would be very helpful.

############ (QUESTION3) ################ I cannot answer you this question exactly as I cant see your output, but my suspection is that you take all of your RepeatMasker Outputs to get to the 18%. Many RepeatMasker and RepeatModeler Outputs have unclassified repeat annotations (so for these repeats RepeatMasker and RepeatModeler were not sure if it really is a transposon). ReasonaTE only considers the annotations that RepeatMasker and RepeatModeler declare as transposons. This is why you only find 1%.

What do you think about that?

I hope I could answer your questions, looking forward for your answers and getting back in touch, Best regards, Kevin