TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
139 stars 20 forks source link

overextension of TEs #115

Closed ferrojm closed 4 months ago

ferrojm commented 4 months ago

Hi,

I'd like to report an issue I encountered while running Earlgrey on my data. My genome is quite small ~100Mb, and I used Earlgrey with a custom, non-redundant satellite DNA library generated with Repeatexplorer, not including cd-hit clustering and removing putative spurious TE annotations less than 100bp long. Afterwards, I followed Goubert et al. (2022) to validate the final consensus using cd-hit, TE_ManAnnot, TE-AID, etc.

The problem I observed is that many TEs were significantly overextended during the consensus building process. This resulted in large consensus sequences (10-12kb) with very low complete blast hits (only 1-4). However, when I compared them to the original Rmodeler consensus sequences, they were much smaller (e.g., 80-200bp consensus sequences generated TEs of 5-6kb).

In an extreme case (data attached), a 20kb TE was built from a mere 300bp consensus sequence! This could potentially be due to segmental duplications specific to this genome?

Do you have suggestions for how to address this overextension issue? perhaps reducing the number of extension rounds in Earlgrey could be helpful?

Cheers!

the original consensus image

earlgrey consensus image

TobyBaril commented 4 months ago

Hi,

We sometimes see this kind of overextension in genomes with segmental duplications, and @jamesdgalbraith has also come across this when updating and working on the BEAT process. Currently, the minimum number of sequences required in an alignment is 3, so if we are "lucky" enough to have three full-length duplications the consensus will continue to be extended. As with all TE curation, some level of manual curation will always be required for high-quality libraries - as you have done in this case using TE-Aid etc.

Some other solutions that can help will be:

I hope these recommendations help!

ferrojm commented 4 months ago

Thanks for your clear explanation!

I will follow your recommendations.

Cheers