Wrong region overlap count

ghost commented 8 years ago

We were just discussing the following possible issue; when doing an aggregation between the experiments e47071 - ENCFF000AST and e44810 - ENCFF000ARD, the region count aggregate appears to be wrong. For example, it contains the following:

H3K27me3 positions - mean - count
chr1    10150   10225   5.088   5

Control positions - value
chr1    10125   10150   6
chr1    10150   10175   5
chr1    10175   10200   5.76
chr1    10200   10225   4.44
chr1    10225   10250   4.24

The region count aggregate is 5, but it should only be 3; or is the interpretation different?

felipealbrecht commented 8 years ago

It works as expected. DeepBlue considers full overlap, and since the first region ends at position 10150 (the same as start position) , and the last regions starts at 10225 (the ends position of the peak)

felipealbrecht commented 8 years ago

It can be changed, if you have an reasonable reason and explanation of what must changes.

ghost commented 8 years ago

This way of counting regions is in obvious contrast to what is common when handling BED regions, which are defined as half-open intervals ( ] Consider the behavior of bedtools, file A.bed

chr1    150 225

and file B.bed

chr1    125 150
chr1    150 175
chr1    175 200
chr1    200 225
chr1    225 250

now the output of

bedtools intersect -wo -a A.bed -b B.bed

is the following

chr1    150 225 chr1    150 175 25
chr1    150 225 chr1    175 200 25
chr1    150 225 chr1    200 225 25

This output is in line with the definition of a (BED) region.

felipealbrecht commented 8 years ago

Okay, simply explain how must be made.

felipealbrecht commented 8 years ago

Does the start must be EQUALS and BIGGER The end must be only SMALLER ?

felipealbrecht commented 8 years ago

As it is a 'bug', I will fix it asap.

felipealbrecht commented 8 years ago

@mpi-pebert how about partial overlaps? they must be included in the aggregation? or only fully-overlapped regions?

ghost commented 8 years ago

Partial overlap needs to be taken into account to mimic the behavior of bedtools (which uses two parameters to set how much overlap [in percent] is necessary to be considered overlapping). The base overlap calculations in explicit form look like this:

IF B.end <= A.start OR B.start >= A.end:
    return 0
ELSE:
    return min(A.end, B.end)) - max(A.start, B.start)

Does that answer your question?

Edit: fixed "start - end" to "end - start"

felipealbrecht commented 8 years ago

Your pseudo code does not handle correctly the problem that you pointed you. In fact, the if coondition is exactly how it is made today, and it does not work correctly. The code in question is: https://github.com/MPIIComputationalEpigenetics/DeepBlue/blob/master/server/src/algorithms/aggregate.cpp#L77

ghost commented 8 years ago

Let's check... A and B as above...

IF B.end(150) <= A.start(150) OR ....
    return 0  # no overlap

Next region...

IF B.end(175) <= A.start(150) OR B.start(150) >= A.end(225):
    # is false
ELSE
    return min(A.end(225), B.end(175)) - max(A.start(150), B.start(150))
    return 175 - 150 = 25  # correct overlap

...and so on... only thing wrong with the pseudo code is that I made "start - end" instead of "end - start", but this does not affect the if condition that you mentioned.

Checking your code, however, reveals and AND [&&] if I am not mistaken...

 if (((*it_ranges)->start() <= (*it_data)->end()) && ((*it_ranges)->end() >= (*it_data)->start()))

felipealbrecht commented 8 years ago

Anyway, it is not it the problem thhat I say. And you are couting the length. I am NOT counting the length in the aggregation, but filtering the regions that must be aggregated.

felipealbrecht commented 8 years ago

Basicaly you give the solution for another problem, not the aggregation. Anyway, I am already testing it and I will commit today yet.

ghost commented 8 years ago

How would you correctly aggregate with partial overlaps if you are not counting the number of overlapping base pairs (e.g., to calculate the mean)?

felipealbrecht commented 8 years ago

I dont calculate the overlapping base pair. It is simply: does it overlap or dont.

felipealbrecht commented 8 years ago

if (((_it_data)->start() < (_it_ranges)->end()) && ((_it_data)->end() > (_it_ranges)->start())) { [..] Score score = (*it_data)->value(column->pos()); acc.push(score); }

felipealbrecht commented 8 years ago

But I will implement the 'partial' score as well. It is easy, just a cross multiplication.

felipealbrecht commented 8 years ago

@mpi-pebert it is in the servers. Please, let me know if you want me to remove any request where the processing was made incorrectly.

MPIIComputationalEpigenetics / DeepBlue

Wrong region overlap count #141