Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
183 stars 23 forks source link

Repeat identification difference due to the version change? #99

Open yifeng-evo opened 3 years ago

yifeng-evo commented 3 years ago

I was recently trying to reproduce the result of a 2014 paper in which they used repeatmodeler to identify repeats in a non model organism. They had 398 consensus sequences from repeatmodeler. However when I used the same genome I had 911 consensus sequences and had more regions masked using repeatmasker. Do you think this dramatic difference is due to the repeat identification changed in the software version during the past 6 years? Thanks a lot!

rmhubley commented 3 years ago

There have been three releases of RepeatModeler since 2014 and in the latest release (2.0) we added an LTR-specific discovery component to the pipeline. Without knowing which version they used, I would guess it's a combination of factors. Improvements to the pipeline, additional discovery methods, and random sample biases. RepeatModeler employs a sampling technique that increases the sample size as repeats are discovered and masked. When used on a typical mammalian genome the sampling has little effect on the total number of families discovered, but will have a slight effect on which families are discovered and how many copies are available to build a seed alignment. Due to this sampling bias I encourage users to report the "random seed" displayed when they run the program to facilitate 100% reproducible results. If you determine the version of RepeatModeler they used and the seed they used, I could get you a copy of that version of RepeatModeler to assist in your analysis.