Closed ghost closed 8 years ago
It works as expected. DeepBlue considers full overlap, and since the first region ends at position 10150 (the same as start position) , and the last regions starts at 10225 (the ends position of the peak)
It can be changed, if you have an reasonable reason and explanation of what must changes.
This way of counting regions is in obvious contrast to what is common when handling BED regions, which are defined as half-open intervals ( ] Consider the behavior of bedtools, file A.bed
chr1 150 225
and file B.bed
chr1 125 150
chr1 150 175
chr1 175 200
chr1 200 225
chr1 225 250
now the output of
bedtools intersect -wo -a A.bed -b B.bed
is the following
chr1 150 225 chr1 150 175 25
chr1 150 225 chr1 175 200 25
chr1 150 225 chr1 200 225 25
This output is in line with the definition of a (BED) region.
Okay, simply explain how must be made.
Does the start must be EQUALS and BIGGER The end must be only SMALLER ?
As it is a 'bug', I will fix it asap.
@mpi-pebert how about partial overlaps? they must be included in the aggregation? or only fully-overlapped regions?
Partial overlap needs to be taken into account to mimic the behavior of bedtools (which uses two parameters to set how much overlap [in percent] is necessary to be considered overlapping). The base overlap calculations in explicit form look like this:
IF B.end <= A.start OR B.start >= A.end:
return 0
ELSE:
return min(A.end, B.end)) - max(A.start, B.start)
Does that answer your question?
Edit: fixed "start - end" to "end - start"
Your pseudo code does not handle correctly the problem that you pointed you. In fact, the if coondition is exactly how it is made today, and it does not work correctly. The code in question is: https://github.com/MPIIComputationalEpigenetics/DeepBlue/blob/master/server/src/algorithms/aggregate.cpp#L77
Let's check... A and B as above...
IF B.end(150) <= A.start(150) OR ....
return 0 # no overlap
Next region...
IF B.end(175) <= A.start(150) OR B.start(150) >= A.end(225):
# is false
ELSE
return min(A.end(225), B.end(175)) - max(A.start(150), B.start(150))
return 175 - 150 = 25 # correct overlap
...and so on... only thing wrong with the pseudo code is that I made "start - end" instead of "end - start", but this does not affect the if condition that you mentioned.
Checking your code, however, reveals and AND [&&] if I am not mistaken...
if (((*it_ranges)->start() <= (*it_data)->end()) && ((*it_ranges)->end() >= (*it_data)->start()))
Anyway, it is not it the problem thhat I say. And you are couting the length. I am NOT counting the length in the aggregation, but filtering the regions that must be aggregated.
Basicaly you give the solution for another problem, not the aggregation. Anyway, I am already testing it and I will commit today yet.
How would you correctly aggregate with partial overlaps if you are not counting the number of overlapping base pairs (e.g., to calculate the mean)?
I dont calculate the overlapping base pair. It is simply: does it overlap or dont.
if (((_it_data)->start() < (_it_ranges)->end()) && ((_it_data)->end() > (_it_ranges)->start())) { [..] Score score = (*it_data)->value(column->pos()); acc.push(score); }
But I will implement the 'partial' score as well. It is easy, just a cross multiplication.
@mpi-pebert it is in the servers. Please, let me know if you want me to remove any request where the processing was made incorrectly.
We were just discussing the following possible issue; when doing an aggregation between the experiments e47071 - ENCFF000AST and e44810 - ENCFF000ARD, the region count aggregate appears to be wrong. For example, it contains the following:
The region count aggregate is 5, but it should only be 3; or is the interpretation different?