high proportion Unclassified

conniecl commented 3 years ago

Hi I was running RepeatModeler - 2.0.1 for my genome TE annotation, still return a high proportion Unclassified (33.39 %, 53% in total). I have checked the RepBase according to #128. But nothing changed. Is there any way to improve the Unclassified to the known transposons? Here is my commands and results:

BuildDatabase -engine ncbi -name genome mygenome_adj.fa
RepeatModeler -database genome -engine ncbi -pa 40 -LTRStruct -ninja_dir ~/software/anaconda3/envs/repeatmodel/share/NINJA-0.95-cluster_only/NINJA
RepeatMasker -pa 40 -lib genome-families.fa mygenome_adj.fa

==================================================
file name: mygenome_adj.fa         
sequences:           996
total length: 1062127318 bp  (1062083318 bp excl N/X-runs)
GC level:         35.93 %
bases masked:  587570492 bp ( 55.32 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
Retroelements       143402    186842054 bp   17.59 %
   SINEs:                0            0 bp    0.00 %
   Penelope              0            0 bp    0.00 %
   LINEs:            36498     26795956 bp    2.52 %
    CRE/SLACS            0            0 bp    0.00 %
     L2/CR1/Rex          0            0 bp    0.00 %
     R1/LOA/Jockey    1733      5014319 bp    0.47 %
     R2/R4/NeSL          0            0 bp    0.00 %
     RTE/Bov-B        6824      4066661 bp    0.38 %
     L1/CIN4         27941     17714976 bp    1.67 %
   LTR elements:    106904    160046098 bp   15.07 %
     BEL/Pao           243        60644 bp    0.01 %
     Ty1/Copia       55632     70061559 bp    6.60 %
     Gypsy/DIRS1     48596     83424152 bp    7.85 %
       Retroviral        0            0 bp    0.00 %

DNA transposons      37300     21516184 bp    2.03 %
   hobo-Activator    12721      4848869 bp    0.46 %
   Tc1-IS630-Pogo     4090      1685068 bp    0.16 %
   En-Spm                0            0 bp    0.00 %
   MuDR-IS905            0            0 bp    0.00 %
   PiggyBac              0            0 bp    0.00 %
   Tourist/Harbinger   687       304517 bp    0.03 %
   Other (Mirage,        0            0 bp    0.00 %
    P-element, Transib)

Rolling-circles       2451      1433016 bp    0.13 %

Unclassified:       919657    354592831 bp   33.39 %

Total interspersed repeats:   562951069 bp   53.00 %

Small RNA:            7517     12556501 bp    1.18 %

Satellites:              0            0 bp    0.00 %
Simple repeats:     194962      8536855 bp    0.80 %
Low complexity:      42172      2093051 bp    0.20 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element

RepeatMasker version 4.1.2-p1 , default mode

run with rmblastn version 2.10.0+
The query was compared to classified sequences in "genome-families.fa"
FamDB:

jebrosen commented 3 years ago

Is there any way to improve the Unclassified to the known transposons?

RepeatModeler's classifications are in part based on sequence similarity to already-known TE families in the RepeatMasker database (including Dfam and optionally RepBase RepeatMasker Edition); this means that the quality of classification should be better the more closely related the genome is to an already-represented species. Each release of Dfam includes more species, further improving classifications, but there are still many un- or under-represented groups. If your species is only very distantly related to species already in the database, it is more likely that you will discover novel TE families or groups that cannot be reliably classified by automated methods.

You could try running other classification pipelines or methods on your TE library. Different tools may use structural patterns, machine, learning, or other TE or protein databases, which could perform better especially if they have been trained from organisms closely related to yours.

Can you tell us what species you are studying? This would be helpful to us, both in determining if RepeatClassifier performs worse than expected in that species and in assessing what species/clades researchers are interested in that we should prioritize for future work.

conniecl commented 3 years ago

Hi Jeb, Thanks for your quick reply and suggestions. I have run EDTA, maybe EDTA regard mostly of the Unclassified in RepeatModeler as Helitron. But the author of EDTA have reminded us should treat the Helitron (High False Positive) carefully. I'm not sure if the classification are right since 16% is too high.

Besides, the species I studied from sedge family (Cyperaceae), Eleocharis. Mostly clonally propagation but also can sexual propagation.

Hope can get your further suggestions or comments about the results. And thanks again

Here is the detail classification of EDTA:

Repeat Classes
==============
Total Sequences: 1085
Total Length: 1062048304 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
LTR                    --           --           --   
    Copia              86281        84835551     7.99% 
    Gypsy              88257        132523944    12.48% 
    unknown            80776        61564638     5.80% 
TIR                    --           --           --   
    CACTA              43886        17173407     1.62% 
    Mutator            121434       37898223     3.57% 
    PIF_Harbinger      11484        2922615      0.28% 
    Tc1_Mariner        2444         1218807      0.11% 
    hAT                76355        27729547     2.61% 
    polinton           2            222          0.00% 
nonLTR                 --           --           --   
    LINE_element       1899         1550627      0.15% 
    unknown            263          336429       0.03% 
nonTIR                 --           --           --   
    helitron           481926       171172833    16.12% 
repeat_region          77371        103154605    9.71% 
                      ---------------------------------
    total interspersed 1072378      642081448    60.46%

---------------------------------------------------------
Total                  1072378      642081448    60.46%

jebrosen commented 3 years ago

Yes, unfortunately it looks like grasses in general, and sedges in particular, are not well-represented in Dfam nor in RepBase RepeatMasker Edition. You might have better luck comparing your TE consensi against a database that is more plant-focused, perhaps using this list complied by Tyler Elliott as a starting point: https://tehub.org/en/resources/repeat_databases.

Dfam-consortium / RepeatModeler

high proportion Unclassified #146