TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
131 stars 19 forks source link

Annotation accuracy #89

Closed Song-10-YF closed 6 months ago

Song-10-YF commented 6 months ago

Hello, thank you for developing the TE annotation software "EarlGrey." It makes annotating transposable elements very convenient. However, I have some questions. My species is a non-model plant, and compared to traditional annotations with RepeatMasker and EDTA, your software annotates fewer categories of transposable elements. For example, it lacks Tc1-mariner and MuDR but adds new transposable element families. I would like to know the reason and which one is more accurate.

The command I used is: earlGrey -g /home/syf/Jug/Cpa.fasta -s Cpa -o ./Cpa -t 10

I look forward to your reply. Yanfeng

TobyBaril commented 6 months ago

Hi,

Thanks for checking out Earl Grey! When you refer to "categories", are you referencing TE families or consensus sequences? If so, this is expected using Earl Grey as we implement steps to reduce TE library redundancy, which is especially apparent with EDTA. I would highly recommend looking at the Earl Grey manuscript (https://www.biorxiv.org/content/10.1101/2022.06.30.498289v3), where we provide extensive benchmarking and explanation of the differences between Earl Grey, RepeatModeler2, and EDTA.

Regarding TE classification, there are going to be inherent differences among software depending on which databases they have been configured with, and how distant the species being annotated is from those contained in the databases. Whilst automated methods are good for broadscale repeatable annotation, if you need full confidence in certain TE families, some form of manual curation is going to be required. For example, it is likely that some unclassified families in RepeatModeler2 and Earl Grey are non-autonomous TEs such as MITEs or SINEs. It is worth noting that "Unclassified" can still very much be a real TE family, just one for which our current sampling of eukaryotes has not been extensive enough for us to have described another element similar enough. This will of course get better as sampling increases across eukaryotes. Also, be aware that EDTA does not annotate LINEs and SINEs, so the outputs can be quite different if you expect to find a lot of these in your genome of interest.