Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
232 stars 49 forks source link

Repeat Annotations #245

Open anna-elisabet opened 9 months ago

anna-elisabet commented 9 months ago

Hi, I would like to know whether the "Total interspersed repeats" percentage in the tbl output file is the sum of the "Retroelements", "DNA transposons", "Rolling-circles", and "Unclassified". When I add these categories together, I do not get the total as given. If it is not the exact total, how is the total calculated? I have run RepeatMasker on different genomes with different libraries, and in some cases I found that excluding the "Rolling-circles" makes the rest add up to the total given interspersed repeats, but that was not always the case.

I used a custom repeat library generated by RepeatModeler, subsequently classified by DeepTE.

Thanks!

genmor commented 9 months ago

Hello, I'm noticing a similar discrepancy between summed category totals and the sum of each element calculated manually. For example: number of length percentage elements* occupied of sequence Retroelements 1574041 507197627 bp 38.66 % SINEs: 50409 11008570 bp 0.84 % Penelope: 0 0 bp 0.00 % LINEs: 313199 170398254 bp 12.99 % CRE/SLACS 0 0 bp 0.00 % L2/CR1/Rex 37151 16145459 bp 1.23 % R1/LOA/Jockey 142017 107402110 bp 8.19 % R2/R4/NeSL 515 586173 bp 0.04 % RTE/Bov-B 100365 33741377 bp 2.57 % L1/CIN4 3835 1666968 bp 0.13 % LTR elements: 1210433 325790803 bp 24.83 % BEL/Pao 243427 70780631 bp 5.39 % Ty1/Copia 219090 66969649 bp 5.10 % Gypsy/DIRS1 336842 96270425 bp 7.34 % Retroviral 1349 479736 bp 0.04 %

DNA transposons 142553 36526713 bp 2.78 % hobo-Activator 40811 6606508 bp 0.50 % Tc1-IS630-Pogo 29699 10990528 bp 0.84 % En-Spm 0 0 bp 0.00 % MULE-MuDR 18695 2456832 bp 0.19 % PiggyBac 24159 4274124 bp 0.33 % Tourist/Harbinger 5938 615632 bp 0.05 % Other (Mirage, 1228 319519 bp 0.02 % P-element, Transib)

Rolling-circles 63104 18793057 bp 1.43 %

Unclassified: 1798068 422243089 bp 32.18 %

Total interspersed repeats: 965967429 bp 73.62 %

Small RNA: 8616 1717953 bp 0.13 %

Satellites: 7189 1514437 bp 0.12 % Simple repeats: 84476 9717015 bp 0.74 % Low complexity: 7278 586888 bp 0.04 %

Here, if I'm understanding this table correctly, Retroelements=SINEs+Penelope+LINEs+LTR elements. However, the Retroelements as presented in the table appears to be the sum of SINEs, LINEs, and LTR elements. Is this intentional? . Could this be related to #200?

In similar vain to above and the question posed by @anna-elisabet, in the above example, no matter how I calculate total interspersed repeats manually, the values reported < the sum of individual subcategories summed together. I am aware of #228 and that DNA transposons isn't a category total, but a subcategory itself, but that only exacerbates the sum difference.

It's very likely I'm not making a crucial connection and misunderstanding something. Any guidance is appreciated.

I'm using RepeatMasker ver. 4.1.5, if that's helpful.

anna-elisabet commented 9 months ago

Hi @genmor, I was looking at your data and would like to comment. Regarding your first question, I don't see the issue, since your results report no Penelope elements, thus your Retroelements total is still equal to SINEs+Penelope+LINEs+LTR elements.

Second comment, interestingly, I also see that if you sum Retroelements, DNA transposons and Unclassified (excluding rolling-circles) it adds up to the "Total interspersed repeats" given. My guess is still that RepeatMasker does not count "Rolling-circles" as an interspersed repeat, which, according to my understanding of the biology, is wrong. Could @rmhubley maybe confirm this?

genmor commented 9 months ago

Hi @anna-elisabet, thanks for noticing this! it looks like I copy-pasted from the wrong set of results! Below are results where things don't add up properly. In the table below, as I originally described, the sum of SINEs, LINEs, Penelopes, and LTR elements isn't equal to the value presented for retroelements. I was able to trace the difference for total interspersed repeats to the fact that Penelopes aren't included in the sum for retroelements. Shouldn't Penelope be included in retroelements?

I'll note that in #228, the answer given was that DNA transposons isn't a sum of the categories below it, but that they are unspecified DNA transposons. As you note though, the arithmetic here shows that it (i.e., DNA transposons) actually is being treated as a category sum to calculate the total interspersed repeats.

I think the long and short of this is that we need @rmhubley to confirm, lol.

================================================ number of length percentage elements* occupied of sequence

Retroelements 1653929 544427428 bp 42.28 % SINEs: 12767 2122105 bp 0.16 % Penelope: 1308 196610 bp 0.02 % LINEs: 268324 170898139 bp 13.27 % CRE/SLACS 0 0 bp 0.00 % L2/CR1/Rex 54678 24441646 bp 1.90 % R1/LOA/Jockey 106163 80475419 bp 6.25 % R2/R4/NeSL 701 623608 bp 0.05 % RTE/Bov-B 75639 46991836 bp 3.65 % L1/CIN4 8187 6657994 bp 0.52 % LTR elements: 1372838 371407184 bp 28.84 % BEL/Pao 347140 97426611 bp 7.57 % Ty1/Copia 210820 59050107 bp 4.59 % Gypsy/DIRS1 360400 110369588 bp 8.57 % Retroviral 651 143842 bp 0.01 %

DNA transposons 212496 59280790 bp 4.60 % hobo-Activator 41733 6560333 bp 0.51 % Tc1-IS630-Pogo 32033 10508812 bp 0.82 % En-Spm 0 0 bp 0.00 % MULE-MuDR 12510 1197304 bp 0.09 % PiggyBac 30601 7282290 bp 0.57 % Tourist/Harbinger 2730 259076 bp 0.02 % Other (Mirage, 447 132122 bp 0.01 % P-element, Transib)

Rolling-circles 21355 4327283 bp 0.34 %

Unclassified: 1526022 354459363 bp 27.53 %

Total interspersed repeats: 958364191 bp 74.43 %

Small RNA: 11703 3389242 bp 0.26 %

Satellites: 529 134516 bp 0.01 % Simple repeats: 84712 7378502 bp 0.57 % Low complexity: 7928 576171 bp 0.04 %

anna-elisabet commented 9 months ago

Hi @genmor ! Yes I see now, the Penelope counts are not in the total number of Retroelements. I actually had a closer look at my results and I found the same thing! As for the DNA transposons, I interpreted the Rolling-circles as being counted separate because of the layout of the file. My bad!

I think I finally cracked the code: To get the total interspersed repeats as reported in the .tbl file, sum the following: Retroelements, Penelope, DNA transposons, Unclassified. So the Penelopes are indeed being factored in for the total interspersed repeat count, but not for the Retroelements count.

genmor commented 9 months ago

Hi @anna-elisabet, Thanks for confirming. I'm not sure yet if how things are being counted makes sense, but I feel a little better knowing which values fit where. I think I need to do a little reading to understand the logic of how these different items are being categorized.

rmhubley commented 2 months ago

Oh my -- sorry for the confusion, and thanks for supporting each other through this. The .tbl format is a static format that hasn't changed much since Arian first released RepeatMasker. There are different formats for primates, mice, mammals, and one for everything else. Unfortunately, these formats have sometimes lagged behind changes to the classification and I believe you have identified a bug with Penelope here. Penelope used to be classified as "LINE/Penelope" and now is "PLE/" (with quite a few subtypes "Athena", "Chlamys" etc). The table section "Retetroelements" is missing the new type "PLE" in it's tabulation. I will correct this in the next release. Since there isn't a one-size-fit-all to summarizing results, we also provide the util/buildSummary.pl script which performs a per-class, and per-family tabulation of the .out file which would be more useful to you in this circumstance.