Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

About the selection of transposition database #162

Closed YangNan97 closed 2 years ago

YangNan97 commented 2 years ago

Dear developers

Hello, I'm studying the transposons of Orthoptera insects. I used repeatmodeler to build a self built transposon Library of species, and then used the self built library to annotate the transposons in RepeatMasker. It is necessary for me to add a public database such as RepBase on the basis of self built database. If necessary, how to merge the self built database and RepBase? Will the result of merging the database affect the analysis of transposon divergence? Does that affect the results of the calcdiverscefromalign and createrepeatlandscape ?

I look forward to your reply very much.

rmhubley commented 2 years ago

These are good questions. It really depends on what is available in these other databases. There might not be anything relevant to your organism in the other databases or they may simply significantly overlap the families you already found using de novo methods. This will require some direct comparisons between the libraries while paying close attention to the taxa labels defined for the families to avoid false matches. If, such a combined analysis is preferred then I would recommend merging only the relevant records into a single fasta file and using the "-lib" option of RepeatMasker to do a single analysis. In this way all families compete equally to the annotation. If you prefer one library over the other you could break this into two runs with the preferred library masking/annotating the genome first followed by a second run using the masked sequence file and your secondary library. You would then need to combine the results from both runs to assess.

In terms of divergence calculations, the impact is hard to characterize in advance. It could be that by combining libraries you have duplicate families where one outcompetes the other in annotating sequence. This will change the number and character of alignments assigned to the de novo identified family and thus change the average divergence for it. Again, if duplicate families have different classifications the same effect might be seen on the landscape graph. In the best of worlds the changes would be minor.