Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

How to merge different RepeatMasker genome? #5

Closed tiramisutes closed 6 years ago

tiramisutes commented 6 years ago

Hi, I want to perform repeat-masking on my study genome. First, I generated de novo repeat library by RepeatModeler (consensi.fa.classified). In RepeatMasker step, I get two resulted with -species and -lib parameter through run RepeatMasker twice. Now, my question is how to merge this two masked genome for subsequent analysis?

Any help is much appreciated. Thanks.

rmhubley commented 6 years ago

There are a couple of issues to consider with this type of analysis. The first is the quality of the custom library you are feeding to RepeatMasker. Typically there would be some curation that is needed before using a RepeatModeler library with RepeatMasker ( e.g. remove redundantly discovered repeat fragments, extend fragments to full length, identify subfamilies and remove any ancestral repeats that are already cataloged in Repbase or Dfam-consensus. Next you will be better served by using RepeatMasker to mask the sequence serially rather than independently. Depending on which organism you are working with there may be repeats already defined in RepBase for related species or clades. In that case you would probably run your genome through with the "-species" option first. Then take the *.masked file and run it again with your library ( "-lib" ) to generate additional annotations that can easily be merged with the first. Ideally what would happen is that your new library would be submitted to Dfam-consensus and incorporated into the RepeatMasker libraries so that all sequences con be correctly competed against each other during the search phase. Let me know if you have any further questions.

tiramisutes commented 6 years ago

Thanks. I will repeat-masking again as your suggestion.

minhasbushra commented 2 years ago

Thanks, I followed your suggestion for repeat masking in two rounds ... I did round 1 with the species option, and used the masked genome as input for another round with my specific library. I was wondering how can I combine the .tbl option of the two respective outputs and how will I get the total number of the masked genome and the percentage of each repeat ? I am assuming the masked genome after the second round will be ready to use for genome annotation? or am I missing something because when I have compared the two masked files, I see some sequences that were masked in the first round(with specie) were unmasked in the second round (with the custom library). I was assuming the masked sequence from the first sequence will remain as such.