overextension of TEs - Githubissues

ferrojm commented 4 months ago

Hi,

I'd like to report an issue I encountered while running Earlgrey on my data. My genome is quite small ~100Mb, and I used Earlgrey with a custom, non-redundant satellite DNA library generated with Repeatexplorer, not including cd-hit clustering and removing putative spurious TE annotations less than 100bp long. Afterwards, I followed Goubert et al. (2022) to validate the final consensus using cd-hit, TE_ManAnnot, TE-AID, etc.

The problem I observed is that many TEs were significantly overextended during the consensus building process. This resulted in large consensus sequences (10-12kb) with very low complete blast hits (only 1-4). However, when I compared them to the original Rmodeler consensus sequences, they were much smaller (e.g., 80-200bp consensus sequences generated TEs of 5-6kb).

In an extreme case (data attached), a 20kb TE was built from a mere 300bp consensus sequence! This could potentially be due to segmental duplications specific to this genome?

Do you have suggestions for how to address this overextension issue? perhaps reducing the number of extension rounds in Earlgrey could be helpful?

Cheers!

the original consensus

earlgrey consensus

TobyBaril commented 4 months ago

Hi,

We sometimes see this kind of overextension in genomes with segmental duplications, and @jamesdgalbraith has also come across this when updating and working on the BEAT process. Currently, the minimum number of sequences required in an alignment is 3, so if we are "lucky" enough to have three full-length duplications the consensus will continue to be extended. As with all TE curation, some level of manual curation will always be required for high-quality libraries - as you have done in this case using TE-Aid etc.

Some other solutions that can help will be:

Using the -i flag to reduce the number of BEAT rounds (the default is 10, but you could safely drop this a fair amount in this case)
Use the -f flag to reduce the number of flanking bases added to existing consensi in each BEAT round (Default is 1,000bp but again you could drop this to reduce extension in each round)
In your case, you could also modify the source code of TEstrainer (/path/to/envs/earlgrey/share/earlgrey-[version]/scripts/TEstrainer/scripts/TEtrim.py). Specifically, line 81 which is currently if(len(align)<3):, could be changed to a more preferable integer, preferably more than however many segmental duplications you find in the above example. This may have a knock-on effect on low copy-number families however, so I would recommend keeping the current results and checking to see if you have any missing families that you can supplement the final library with.

I hope these recommendations help!

ferrojm commented 4 months ago

Thanks for your clear explanation!

I will follow your recommendations.

Cheers

TobyBaril / EarlGrey

overextension of TEs #115