TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
139 stars 20 forks source link

Calculation of LTR insertion time #122

Closed JMUwenjian closed 4 months ago

JMUwenjian commented 4 months ago

Dear author, I am immensely grateful for providing such a convenient and useful tool. I have successfully completed the data testing and achieved satisfactory annotation results. Now, I have a question I would like to consult with you regarding. Specifically, if I intend to conduct an analysis on the insertion timing of LTRs based on all the output generated by EarlGrey, which output file should I utilize, and what software would be most suitable for this purpose?

TobyBaril commented 4 months ago

Hi! Specifically for LTR insertions, the general way we calculcate age is to extract the LTR sequences from each end of full-length LTRs and compare the divergence between them. To do this, you will need to use specific programs to identify intact LTR elements. You can extract these from the Earl Grey outputs if you go to: /path/to/[species]EarlGrey/[species]_mergedRepeats/looseMerge/[species]LtrFinder and then look at the GFF3 file. You will see entries like this:

ctg_1   LTR_FINDER_parallel     repeat_region   327141  333079  .       +       .       ID=repeat_region1
ctg_1   LTR_FINDER_parallel     LTR_retrotransposon     327141  333079  .       +       .       ID=LTR_retrotransposon1;Parent=repeat_region1;tsd=CTAGC;ltr_identity=0.852;seq_number=0
ctg_1   LTR_FINDER_parallel     long_terminal_repeat    327141  327343  .       +       .       Parent=LTR_retrotransposon1
ctg_1   LTR_FINDER_parallel     long_terminal_repeat    332857  333079  .       +       .       Parent=LTR_retrotransposon1

In this case, the two bottom rows with long_terminal_repeat labels are the LTRs at either end of the full-length element, so you can extract the sequence of these coordinates. You will also see on the line LTR_retrotransposon, column 9 has a lot of information, including ltr_identity=0.852. This is the sequence similarity between the 5' and 3' LTR sequences. If this is 1, then the LTRs are identical and the insertions is extremely recent. If you have a neutral mutation rate for your species, you can then apply this to the divergence to estimate time of insertion (with error, as this assumes TEs are neutral, when they are likely at least under weak purifying selection). Even with just the divergence numbers, you will be able to see which LTRs are more recent and those that are more ancient.

JMUwenjian commented 3 months ago

Thank you very much for your reply, which greatly solved my problems and doubts. So can I understand the calculation of LTR insertion time as follows: T= (1-LTR_retrotransposon) / 2 r, where r represents the number of substitutions per synonymous mutation site per year. Looking forward to your reply. Wish you all the best.