About the selection of transposition database

These are good questions. It really depends on what is available in these other databases. There might not be anything relevant to your organism in the other databases or they may simply significantly overlap the families you already found using de novo methods. This will require some direct comparisons between the libraries while paying close attention to the taxa labels defined for the families to avoid false matches. If, such a combined analysis is preferred then I would recommend merging only the relevant records into a single fasta file and using the "-lib" option of RepeatMasker to do a single analysis. In this way all families compete equally to the annotation. If you prefer one library over the other you could break this into two runs with the preferred library masking/annotating the genome first followed by a second run using the masked sequence file and your secondary library. You would then need to combine the results from both runs to assess.

In terms of divergence calculations, the impact is hard to characterize in advance. It could be that by combining libraries you have duplicate families where one outcompetes the other in annotating sequence. This will change the number and character of alignments assigned to the de novo identified family and thus change the average divergence for it. Again, if duplicate families have different classifications the same effect might be seen on the landscape graph. In the best of worlds the changes would be minor.

Dfam-consortium / RepeatMasker

About the selection of transposition database #162