Annotation accuracy - Githubissues

Hi,

Thanks for checking out Earl Grey! When you refer to "categories", are you referencing TE families or consensus sequences? If so, this is expected using Earl Grey as we implement steps to reduce TE library redundancy, which is especially apparent with EDTA. I would highly recommend looking at the Earl Grey manuscript (https://www.biorxiv.org/content/10.1101/2022.06.30.498289v3), where we provide extensive benchmarking and explanation of the differences between Earl Grey, RepeatModeler2, and EDTA.

Regarding TE classification, there are going to be inherent differences among software depending on which databases they have been configured with, and how distant the species being annotated is from those contained in the databases. Whilst automated methods are good for broadscale repeatable annotation, if you need full confidence in certain TE families, some form of manual curation is going to be required. For example, it is likely that some unclassified families in RepeatModeler2 and Earl Grey are non-autonomous TEs such as MITEs or SINEs. It is worth noting that "Unclassified" can still very much be a real TE family, just one for which our current sampling of eukaryotes has not been extensive enough for us to have described another element similar enough. This will of course get better as sampling increases across eukaryotes. Also, be aware that EDTA does not annotate LINEs and SINEs, so the outputs can be quite different if you expect to find a lot of these in your genome of interest.

TobyBaril / EarlGrey

Annotation accuracy #89