Open conniecl opened 3 years ago
Is there any way to improve the Unclassified to the known transposons?
RepeatModeler's classifications are in part based on sequence similarity to already-known TE families in the RepeatMasker database (including Dfam and optionally RepBase RepeatMasker Edition); this means that the quality of classification should be better the more closely related the genome is to an already-represented species. Each release of Dfam includes more species, further improving classifications, but there are still many un- or under-represented groups. If your species is only very distantly related to species already in the database, it is more likely that you will discover novel TE families or groups that cannot be reliably classified by automated methods.
You could try running other classification pipelines or methods on your TE library. Different tools may use structural patterns, machine, learning, or other TE or protein databases, which could perform better especially if they have been trained from organisms closely related to yours.
Can you tell us what species you are studying? This would be helpful to us, both in determining if RepeatClassifier performs worse than expected in that species and in assessing what species/clades researchers are interested in that we should prioritize for future work.
Hi Jeb,
Thanks for your quick reply and suggestions. I have run EDTA, maybe EDTA regard mostly of the Unclassified
in RepeatModeler as Helitron
. But the author of EDTA have reminded us should treat the Helitron
(High False Positive) carefully. I'm not sure if the classification are right since 16% is too high.
Besides, the species I studied from sedge family (Cyperaceae), Eleocharis. Mostly clonally propagation but also can sexual propagation.
Hope can get your further suggestions or comments about the results. And thanks again
Here is the detail classification of EDTA:
Repeat Classes
==============
Total Sequences: 1085
Total Length: 1062048304 bp
Class Count bpMasked %masked
===== ===== ======== =======
LTR -- -- --
Copia 86281 84835551 7.99%
Gypsy 88257 132523944 12.48%
unknown 80776 61564638 5.80%
TIR -- -- --
CACTA 43886 17173407 1.62%
Mutator 121434 37898223 3.57%
PIF_Harbinger 11484 2922615 0.28%
Tc1_Mariner 2444 1218807 0.11%
hAT 76355 27729547 2.61%
polinton 2 222 0.00%
nonLTR -- -- --
LINE_element 1899 1550627 0.15%
unknown 263 336429 0.03%
nonTIR -- -- --
helitron 481926 171172833 16.12%
repeat_region 77371 103154605 9.71%
---------------------------------
total interspersed 1072378 642081448 60.46%
---------------------------------------------------------
Total 1072378 642081448 60.46%
Yes, unfortunately it looks like grasses in general, and sedges in particular, are not well-represented in Dfam nor in RepBase RepeatMasker Edition. You might have better luck comparing your TE consensi against a database that is more plant-focused, perhaps using this list complied by Tyler Elliott as a starting point: https://tehub.org/en/resources/repeat_databases.
Hi I was running
RepeatModeler - 2.0.1
for my genome TE annotation, still return a high proportionUnclassified
(33.39 %, 53% in total). I have checked the RepBase according to #128. But nothing changed. Is there any way to improve theUnclassified
to theknown
transposons? Here is my commands and results: