Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

RepeatMasker Failing to Find Simple Repeats #137

Closed jjrozewicki closed 2 years ago

jjrozewicki commented 2 years ago

When running RepeatMasker on genome sequenced assembled from HiFi long reads, significant repetitive portions of scaffolds are left unmasked (sometimes up to 10% of total bases). This has a negative impact on downstream analysis such as gene prediction. To account for this we have recently been combining the results of both RepeatMasker and TRF ourselves as a post-processing step. This addresses the issue somewhat, but it seems like a bug that RepeatMasker is not finding these repeats.

It is true that TRF is run (twice) during the RepeatMasker process, but we have verified that the arguments being used to find specific kinds of simple repeats cause RepeatMasker to miss some that are found when TRF is run with the recommended default options.

We have verified this pattern on the publicly available sScyCan1.1 genome:

Top 30 scaffolds sorted by missed repeats between RM and TRF+RM: image

This trend may have been negligible when genome assemblies were obtained from short reads, but could become more and more evident along with the rise of high-fidelity long-read assembly that integrates massive simple repeats into resultant sequences.

gaminyeh commented 2 years ago

I got the same issue. RepeatMasker can't find the repeat(Simple Repeats) of the centromere in my genome.

This is the distribution of repeat(yellow) and gene(green) of chromosome 2, the sequence in the center can't detect any repeat, but I took the sequence of the region into NCBI it shows those are microsatellites. image

rmhubley commented 2 years ago

Historically RepeatMasker employed the use of fixed sets of tandem repeat patterns to identify short ( <10bp ) period repeats. These low-complexity sequences account for a large portion of false positive high-scoring matches between TE families and the genome, and by identifying these stretches early in the annotation process we were able to decrease the false positive rate significantly. For higher order tandem sequences (satellites, micro/mini satellites etc) RepeatMasker relies on libraries of families to identify these patterns. A highly curated library will contain these sequences and they will be matched along with TE families known to be present in the species being searched. At one point we switched to using TRF to perform the short period tandem repeat search and found that it produced a fair amount of false positive matches in our tests. We therefore developed a rescoring mechanism and a set of filters on TRF results and limited the period of its search to avoid matching typical satellite sequences. If a species library does not contain satellite families, this will unfortunately leave these unmasked. We do plan to revamp this portion of the codebase in the future to allow for seamless adjudication of simple/tandem/satellite sequences, but it's still in the works. Meanwhile, satellite sequences should be included as part of the TE library ( with the #satellite" classification ) in order to be detected and if you want the lower scoring TRF hits, a post-RepeatMasker run of TRF as you have done is a good way to go. In the short term, we will also consider simply adding a parameter to RepeatMasker to provide unfiltered TRF hits in the masking process.