Which results would be more accurate, denovo pipeline or DeepTE?

tinyfallen commented 2 years ago

Hi dear developers, Thanks for your great scripts! I made a fasta file containing 12 sequence from my repeat.lib generated by combining the ltr.lib, modeler.lib, and mite.lib and ran the DeepTE. However, some of the results of DeepTE were different from the LTR or the RepeatModeler pipeline's. So I would like to know which should I use? Looking forward to your reply!

songliVT commented 2 years ago

In our training and testing, MITE performs very well with DeepTE. Can you try another program such as MITEFinder II and see what's the prediction?

Song

On Mon, Dec 20, 2021 at 3:41 AM tinyhys @.***> wrote:

Hi dear developers, Thanks for your great scripts! I made a fasta file containing 12 sequence from my repeat.lib generated by combining the ltr.lib, modeler.lib, and mite.lib and ran the DeepTE. However, some of the results of DeepTE were different from the LTR or the RepeatModeler pipeline's. So I would like to know which should I use? Looking forward to your reply! [image: image] https://user-images.githubusercontent.com/37066354/146737763-2198af9e-5cd1-46e0-a5ea-00b1c4a29688.png

— Reply to this email directly, view it on GitHub https://github.com/LiLabAtVT/DeepTE/issues/13, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEEENTSOVBDGO4MLQFVD3TUR3T4PANCNFSM5KNDZZHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Associate Professor in Plant Genomics and Bioinformatics School of Plant and Environmental Sciences Virginia Polytechnic Institute and State University

zoom https://virginiatech.zoom.us/j/8790572835

yanhaidong1 commented 2 years ago

Hey, tinyhys, for the LTR/unknown, I suggest to use the following in the manual:

Classify LTR TEs DeepTE.py -d working_dir -o output_dir -i input_seq.fasta -sp P -m P -fam LTR Or DeepTE.py -d working_dir -o output_dir -i input_seq.fasta -sp P -m_dir Plants_model/ -fam LTR

this will help classify the LTR/unknown to subfamily.

tinyfallen commented 2 years ago

Thanks for your reply and advices and I have tried the mitefinder II https://github.com/jhu99/miteFinder . 93% of the MITEs are consistent with deepTE's classification, but some of them were identified as other TE types.

I want to solve the high unclassified repeat portion in the RepeatMasker's *.tbl file and I would like to know whether the classification system used in deepTE was the same of RepeatMasker or not? And how to modify the IDs in opt_DeepTE.fasta so that the RepeatMasker could recognize the types? Maybe the "#" in the RepeatModeler's lib was the symbol, but I still have no idea which repeat type behind # would be recognized.

Thanks again for your excellent scripts !

yanhaidong1 commented 2 years ago

Hey tinyhys, the naming system may not be similar as RepeatModeler. I guess you can try without modifying the name in the opt_DeepTE.fasta that is directly used as lib in RepeatMasker. If it does not work, you can modify a little bit to keep same as you found in the RepeatModeler's lib. From my experience, you can directly use opt_DeepTE.fasta as input for RepeatMasker

tinyfallen commented 2 years ago

Thanks very much for your reply! Maybe I should go to RepeatMasker forum seeking for relative information.

tinyfallen commented 2 years ago

Hi yanhaidong1 and songliVT , by referring to the repeatmasker.help, I check the RMRB.embl and RMRBSeqs.embl for repeat types because I don't find any RepeatMasker.embl. These are what I found and it seems no MITE subclass, or it has another name? repeatmasker.classification.txt

tinyfallen commented 2 years ago

Sorry for my ignorance. I just find the description of MITE subclasses in a paper which says

"All MITE-related sequences were classified into Mutator superfamily, PIFHarbinger superfamily, and hATsuperfamily, and further divided into 110 families. Each family was named as DT(A/M/H)1-n (hATcorresponds to DTA, Mutator corresponds to DTM, PIF-Harbinger corresponds to DTH, and number 1-n corresponds to specific family number). Mutator superfamily contained 82 families, hAT superfamily contained 20 families, and PIF-Harbinger superfamily contained 8 families. In total, 61,980 full-length MITEs were annotated in Citrus species, and the average length of MITE-related sequences covered ~ 3% of the total genome sequences"

TEs are so much more complicated than I ever thought! Thanks again for your great scripts!

Maybe you could place a hierarchical tree in the website which shows all the TEs' classes and subclasses labelled with the name used in deepTE's result so that it could be more understandable for a new guy in this field.

tinyfallen commented 2 years ago

Recently I found a TE sequence ontology in EDTA's utils, maybe it is a good choice to follow the terms.

yanhaidong1 commented 2 years ago

yeah, it may be a good way to follow. the DeepTE mainly follows 'Wicker T. et al. (2007) A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet.' I think the basic ideas would be similar.

LiLabAtVT / DeepTE

Which results would be more accurate, denovo pipeline or DeepTE? #13