Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

Unusual summary in .tbl #205

Closed tbusschau closed 1 year ago

tbusschau commented 1 year ago

I'm using RepeatMasker to serially mask my genome assembly with different libraries. First soft masked simple repeats. Then hard masked the output with Tetrapod elements from Repbase and this output is then hard masked using a custom library. The custom library was generated from multiple species using repeatmodeler, combining the families and removing redundant sequences. After the third round of masking the .tbl summary shows very high 'percentage of sequence' for one group, 17109.00 %. It does not correspond to the 'length occupied'. See .tbl below. What could be the reason for this?

As part of the workflow I then combine the .cat files and use ProcessRepeats. This produces a .tbl file with a similar percentage for these elements. So I'm thinking the issue is somewhere in the .cat file. Still does not make sense since the 'length occupied' is very low.

known_mask.txt

rmhubley commented 1 year ago

Oh don't combine the ".cat" files, that will cause all sorts of problems. Those are files that are not meant to be altered between runs. They use internally unique identifiers for each alignment that are not guaranteed to be unique between invocations so there will be all sorts of overlap that could cause problems for ProcessRepeats. Perhaps you are doing this only for the summary statistics you mentioned (".tbl" file)? If so, you should consider using the RepeatMasker utility util/buildSummary.pl to obtain the summary statistics instead. Since you are getting annotation files (".out") from each run, to merge those into a single output you will need to merge the IDs (also not globally unique) to avoid the same issue. I would recommend adding a prefix to that column in each ".out" file (e.g "1" becomes "A_1", and in the next file "1" becomes "B_1" etc.). Then combine the files, and sort on seq_id, start_pos. Finally you can convert the IDs back to numbers using a hash/dictionary to assign an increasing number to each observed ID string. E.g in perl:

open IN,"<combined_sorted.out" or die;
my %id_hash = ();
my $new_id = 1;
while (<IN>){
  # Get rid of spaces in front of scores
  s/^\s+//; 
  # Only keep lines starting with a number
  next if ( ! /^\d+/);
  # Split into fields
  my @out_fields = split(/\s+/);
  # Grab the ID column
  my $ID = $out_fields[14];
  if ( ! exists $id_hash{$ID} ) {
     $id_hash{$ID} = $new_id;
     $new_id++;
  } 
  $out_fields[14] = $id_hash{$ID};
  print "" . join(" ", @out_fields) . "\n";
}
rmhubley commented 1 year ago

In RepeatMasker 4.1.5 (released yesterday) I added a script to do this type of result merging with extensions to also fix *.align files (if generated). The script is in the util directory and is called "combineRMFiles.pl".

tbusschau commented 1 year ago

@rmhubley Thank you. This is very helpful. I've tested different libraries and it seems my very high percentage in the .tbl summary is specific to one custom library.