Closed tbusschau closed 1 year ago
Oh don't combine the ".cat" files, that will cause all sorts of problems. Those are files that are not meant to be altered between runs. They use internally unique identifiers for each alignment that are not guaranteed to be unique between invocations so there will be all sorts of overlap that could cause problems for ProcessRepeats. Perhaps you are doing this only for the summary statistics you mentioned (".tbl" file)? If so, you should consider using the RepeatMasker utility util/buildSummary.pl to obtain the summary statistics instead. Since you are getting annotation files (".out") from each run, to merge those into a single output you will need to merge the IDs (also not globally unique) to avoid the same issue. I would recommend adding a prefix to that column in each ".out" file (e.g "1" becomes "A_1", and in the next file "1" becomes "B_1" etc.). Then combine the files, and sort on seq_id, start_pos. Finally you can convert the IDs back to numbers using a hash/dictionary to assign an increasing number to each observed ID string. E.g in perl:
open IN,"<combined_sorted.out" or die;
my %id_hash = ();
my $new_id = 1;
while (<IN>){
# Get rid of spaces in front of scores
s/^\s+//;
# Only keep lines starting with a number
next if ( ! /^\d+/);
# Split into fields
my @out_fields = split(/\s+/);
# Grab the ID column
my $ID = $out_fields[14];
if ( ! exists $id_hash{$ID} ) {
$id_hash{$ID} = $new_id;
$new_id++;
}
$out_fields[14] = $id_hash{$ID};
print "" . join(" ", @out_fields) . "\n";
}
In RepeatMasker 4.1.5 (released yesterday) I added a script to do this type of result merging with extensions to also fix *.align files (if generated). The script is in the util directory and is called "combineRMFiles.pl".
@rmhubley Thank you. This is very helpful. I've tested different libraries and it seems my very high percentage in the .tbl summary is specific to one custom library.
I'm using RepeatMasker to serially mask my genome assembly with different libraries. First soft masked simple repeats. Then hard masked the output with Tetrapod elements from Repbase and this output is then hard masked using a custom library. The custom library was generated from multiple species using repeatmodeler, combining the families and removing redundant sequences. After the third round of masking the .tbl summary shows very high 'percentage of sequence' for one group, 17109.00 %. It does not correspond to the 'length occupied'. See .tbl below. What could be the reason for this?
As part of the workflow I then combine the .cat files and use ProcessRepeats. This produces a .tbl file with a similar percentage for these elements. So I'm thinking the issue is somewhere in the .cat file. Still does not make sense since the 'length occupied' is very low.
known_mask.txt