Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
183 stars 23 forks source link

Length of identified repeats #102

Open FabianDK opened 3 years ago

FabianDK commented 3 years ago

In your paper you report that RepeatModeler2 has a low number of false positives. I am wondering, however, if repeats with a small length are more likely false positives than larger ones?

In my analysis, I obtained 464 repeats, of which about 10% are below 100bp and almost 50% are below 500bp (min = 56bp, max = 17331bp, average = 1131 bp).

Would you recommend to filter the identified repeat sequences for a minimum length?

rmhubley commented 3 years ago

Sorry for the long delay. It is hard to say from size alone. It really depends on the organism, the classes of TEs etc. In many cases shorter sequences may simply be fragments of true, but much longer families. In curating a de-novo generated library we typically take the longer sequences first and then, after curation ( ie. extension ) we compare the smaller fragments against the curated library to see if we can discard duplicated results or identify subfamilies. The remaining set are then extended ( if possible ) and a final library is generated.