Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
230 stars 50 forks source link

How to filter the .align file #164

Closed elcortegano closed 2 years ago

elcortegano commented 2 years ago

Hi,

I am using the .align to run the calcDivergenceFromAlign.pl script, and I am wondering if there is any way to filter data in the alignment file (and other output files from RepeatMasker) by % of sequence identity and length of the matched sequence.

RepeatMasker was run using a custom library for a well-known repeat motif with a fixed length (1-2 kb) and excluding small repeats (-nolow). However, this motif itself contains short microsatellite sequences that do appear in the output files, making the .align highly unspecific to the real queried motif.

Version of RepeatMasker is 4.1.2-p1, installed from bioconda.

rmhubley commented 2 years ago

In the util/ directory there is a script written by David Ray (RM2BED.py) that may be useful. It can read in a .out or .align and filter on min_length or min/max divergence. Unfortunately we do not have a universal set of tools to do this just a set of adhoc scripts we have used over the years internally.