Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

How to deal with asterisks in the .out file? #167

Closed xyz0o closed 2 years ago

xyz0o commented 2 years ago

I want to know if I should remove the rows indicated with asterisk from my further analyses?

I have masked the contigs assembly of a drosophila species on the custom made library and I have too many asterisks in my .out table, based on the table description: "An asterisk (*) in the final column (no example shown) indicates that there is a higher-scoring match whose domain partly (<80%) includes the domain of this match." Does that mean that it could indicate nested TEs? mostly in the heterochromatin regions?

rmhubley commented 2 years ago

This typically indicative of a problem with a TE library. A TE library should be as succinct as possible and not contain redundant models. The asterisk is indicating that two TE models (typically with conflicting classes) both align to the same region of your sequence. If the classifications do not match RepeatMasker considers this unresolvable without further curation. Typically this would lead to the identification that the two families in question are probably from the same class ( or a mosaic ) and either the classification is updated in the library or one of the families is deemed redundant and removed. Mosaicism will always cause a small number of conflicts like this even in the best of libraries. But to your original question, we typically do not remove them because they represent small fraction of the results with a curated library. But if you want to remove them automatically you would need to define the criteria by which you pick the better annotation. Higher scores are typically used, but it may also be length or divergence depending on your needs.

rmhubley commented 2 years ago

I am going to close this for now. Please let me know if you have any further questions.