JinfengChen / cdhit

Automatically exported from code.google.com/p/cdhit
GNU General Public License v2.0
0 stars 0 forks source link

cd-hit-otu-filter.pl incorrect coverage per position calculation #14

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. run the script

What is the expected output? What do you see instead?

Each position will have accurate average quality calculated based on coverage 
at that position.  Instead, positions that have low coverage will have much 
lower average quality values.  Also, the conversion of quality chars to phred 
scores could be optimized.

What version of the product are you using? On what operating system?

cd-hit-otu-illumina-0.0.1, CentOS 5.6 x86_64

Please provide any additional information below.

I've rewritten and tested these portions of the code.   See below:

Current:
      for ($i=0; $i<$lena; $i++) { my $c1 = ord(substr($quaa,$i,1))-$offset; $score_array[$i]  += $c1; }
      $score_count++;

New:
      @qual = split("", $quaa);
      @phred = map(ord($_)-$offset, @qual);
      $pos = 0;
      foreach (@phred) {
          $score_array[$pos] += $_;
          $score_counts[$pos] += 1;
          $pos += 1;          
      }
  process_score("$output.err", @score_counts, @score_array);

As such, the process_score method also needs a little tweaking to:

sub process_score {
  my ($output, @score_counts, @scores) = @_;
  my $len = $#scores+1;
  my ($i, $j, $k);
  open(SSS, "> $output") || die "can not write $output\n";

  for ($i=0; $i<$len; $i++){
    my $ave = int($scores[$i]/$score_counts[$i]);
    my $e = 1 / (10 ** ($ave/10));
    print SSS "$i\t$ave\t$e\n";
  }
  close(SSS);
}

Original issue reported on code.google.com by ch...@friedline.net on 5 Apr 2013 at 3:06

Attachments:

GoogleCodeExporter commented 9 years ago
Sorry, have to deref those arrays.  I don't write Perl much anymore (for good 
reason)  ;-)

process_score("$output.err", \@score_counts, \@score_array);

sub process_score {
  my ($output, $score_counts_ref, $scores_ref) = @_;
  @scores = @{$scores_ref};
  @score_counts = @{$score_counts_ref};
  my $len = $#scores+1;
  my ($i, $j, $k);
  open(SSS, "> $output") || die "can not write $output\n";

  for ($i=0; $i<$len; $i++){
    my $ave = int($scores[$i]/$score_counts[$i]);
    my $e = 1 / (10 ** ($ave/10));
    print SSS "$i\t$ave\t$e\n";
  }
  close(SSS);
}

Original comment by ch...@friedline.net on 5 Apr 2013 at 5:00