Terim1 / zinba

Automatically exported from code.google.com/p/zinba
0 stars 0 forks source link

issue with mappability scoring algorithm #55

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Hi,
I noticed what might be an error in your method to calculate mappability, 
unless I misunderstood something...

When you shift the mappability score from the peakseq files, you shift it from 
each position both plus and minus half the average fragment length. I assume 
you do this to account for mappability for reads at both ends of the fragment. 
That means that each mid position of a fragment of length F, lets call it 
p_mid, will have the summed scores from positions p_up = p_mid – 0.5*F+1, and 
p_down = p_mid+0.5*F.

I assume that each position in peakseq mappability map stores the mappability 
score for a k-mer of length K starting in that position on the forward strand 
of the genome, right? 
If this is the case, the p_down score will actually reflect the mappability of 
a k_mere outside of the fragment, since it starts at the end of the fragment 
and ends in position p_down+K-1.
Here is an illustrative example

F = 200
K = 50

p_mid = 100
fragment_start = p_mid-0.5*F+1 = 1
fragment_end = p_mid+0.5*F = 200

p_up = p_mid-0.5*F+1 = 1
seq_up_start = p_up = 1
seq_up_end = seq_up_start+K-1 = 50

p_down = p_mid+0.5*F = 200
seq_down_start = p_down = 200
seq_down_end = seq_down_start+K-1 = 249

(seq_up_start, seq_up_end, seq_down_start and seq_down_end  denotes the start 
and end positions of the k-mer of length K queried for that position)

As you see, seq_down_end is outside of the fragment. To get the correct score 
from the downstream end of the fragment you should instead consider the 
mappability of the k-mer that ends in p_down, and therefore starts in 
p_down-K+1. 

I.e. change the following code in process_bin_scores.cc:
alignVal = 0;
alignVal += align_count[(i-adjustSize+1)];
alignVal += align_count[(i+adjustSize)];
ofile.write((const char*)&alignVal,1);

Into this:
alignVal = 0;
alignVal += align_count[(i-adjustSize+1)];
alignVal += align_count[(i+adjustSize-K+1)];
ofile.write((const char*)&alignVal,1);

Does this make any sense? Or did I misunderstand what is in the peakseq files? 

In any case, thank you for a nice peak calling tool!

Best regards,
Helena

Original issue reported on code.google.com by helena.s...@gmail.com on 4 Feb 2014 at 2:45