Incorporating known TEs

CSU-KangHu / HiTE

High-precision TE Annotator

GNU General Public License v3.0

42 stars 1 forks source link

Incorporating known TEs #6

Open davidaray opened 1 month ago

davidaray commented 1 month ago

I'm curious as to whether one can use a library of already curated TEs to enhance the analysis and eliminate duplication with previous library work.

I have several species that we have manually curated and I'm hoping to use HiTE. I plan to compare the HiTE libraries to our curated TEs using any of several tools but was wondering if there is a mechanism built in that would allow me to do this automatically.

CSU-KangHu commented 1 month ago

Hello @davidaray,

Thank you for your interest in HiTE. If I understand correctly, you are looking to analyze the shared and differing parts between TEs identified by HiTE and curated TEs. You might consider following the benchmarking method of RepeatModeler2 for this analysis. The Perfect benchmark defines curated TEs and test TEs with minor differences as having > 95% coverage and < 5% divergence.

1721094163293

You can run the following commands:

Run RepeatMasker to compare the curated library with the HiTE library:
```
RepeatMasker -lib ${curated_lib} -nolow -pa ${threads} ${HiTE_lib}
```
Execute the get_family_summary_paper.sh script:
```
sh get_family_summary_paper.sh ${HiTE_lib}.out
```
You can download the get_family_summary_paper.sh script from here.

This will generate several useful files in the current directory, such as file_final.0.1.txt, which shows the lengths and coverage ratios between the curated TEs and HiTE TEs. You can set a coverage threshold based on your needs; TEs exceeding this threshold can be considered shared, while those below can be regarded as novel TEs. Note that file_final.0.1.txt only lists TEs that can be aligned, so any TEs not appearing in this file should also be considered novel TEs.

1721094086355

davidaray commented 1 month ago

Sorry for not replying sooner. I was comparing methods you described to my own established pipelines. I have some rather concerning results. While many of the classifications proposed by HiTE are excellent and I find some good correlations between what HiTE proposes and previous manual curations, I'm finding many misclassifications. I include one example here.

In the image, I analyzed what HiTE labeled as a TIR DNA transposon. However, when I examined it using the methods available through TE-Aid (https://github.com/clemgoub/TE-Aid), I am getting a quite different result. As you can see from the image, this is quite obviously a fragment of a LINE element. It shows the characteristic reduction in copies as you move from 3'-5', it has a very nice match to a known L1 polymerase, and it has a repetitive tail typical of LINEs. Furthermore, it does not harbor any Terminal Inverted Repeats that I can find. As you can see from the upper left box, there are well over 34,000 of these in this genome assembly. Were I to continue calling this a TIR DNA transposon, I would be mislabeling over 34,000,000 bp of the assembly.

This is a little disturbing because I'm finding this to be the case for many of the elements discovered in this species. I've been characterizing TEs for over 20 years and know the importance of getting a good characterization of the TEs in a genome assembly. http://gbe.oxfordjournals.org/content/8/2/403 https://academic.oup.com/gbe/article/11/8/2162/5520444 https://www.science.org/doi/10.1126/science.abn1430

Another potential issue is that the software appears to generate quite a few potential false positives. From this image, you can see that this sequence, labled as an LTR INT by HiTE, probably does contain the coding sequences of an LTR retrotransposon but it's just represented in the genome by a single instance. The other issue is that it appears to be cobbled together from a bunch of different fragments scattered throughout the genome rather than a single element.

Over the past few days, I've written some pipelines to correct these problems for my own analyses but this is problematic for others who may not be as experienced as I am.

I strongly recommend that you include some sort of warning about misclassification in your github repository.

CSU-KangHu commented 1 month ago

Dear Professor David A. Ray,

I had the pleasure of reading your outstanding paper, "Insights into mammalian TE diversity through the curation of 248 genome assemblies" (https://www.science.org/doi/10.1126/science.abn1430), and found it incredibly insightful. Thank you for your remarkable contributions to the TE field. I look forward to learning more from your work.

While HiTE has shown promise, there is still room for improvement. I am currently working on enhancing the detection modules for all types of TEs, including LTR, TIR, Helitron, and non-LTR, to achieve higher performance.

Could you kindly provide me with the DNA sequences and genomes of the two examples mentioned in your response above? This would greatly assist me in further improving HiTE.

Best regards,

Kang Hu

davidaray commented 1 month ago

Thank you for the kind response. I appreciate the tool and do indeed find it useful.

The examples that I shared are not available publicly and I'm under an agreement not to share the assemblies that they came from. However, I've found some examples from an assembly that is available from NCBI that I've also run through the pipeline.

The assembly is here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_027563665.1/

I'm also linking four relevant files that are similar to the cases I described in my earlier message.

https://www.dropbox.com/scl/fi/zjph7ekxmzouho6qmxbhr/mAntPal2.1.pri_LTR_38_INT-.c2g.pdf?rlkey=dui3kfq88k6ns5tf2clogun0k&dl=0 https://www.dropbox.com/scl/fi/0aj0x6ioytb8ncilzjxi4/mAntPal2.1.pri_LTR_38_INT-_rep.fa?rlkey=cw4yltgva4rqmdi8xtj01wl6z&dl=0 https://www.dropbox.com/scl/fi/3atq1du9duez00rj2weur/mAntPal2.1.pri_TIR_373-.c2g.pdf?rlkey=c2mo2l2z1aaw1601nm7sv4ef0&dl=0 https://www.dropbox.com/scl/fi/peaogym6r213rasxrv2qq/mAntPal2.1.pri_TIR_373-_rep.fa?rlkey=0zus2oy09srkn6veisz0nlekn&dl=0 Please let me know if you need additional information.

David

CSU-KangHu commented 1 month ago

Dear Professor David A. Ray,

I am very pleased to hear that you found HiTE helpful, and I am grateful for the data you provided. I will carefully study your suggestions to improve the tool further. Additionally, I am eagerly looking forward to your next publication. Wishing you continued success in your research.

Best regards,

Kang Hu