DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

Cover messes up with strand #100

Closed pp86 closed 6 years ago

pp86 commented 6 years ago

consider the statement: COVER(2, ANY) When applied on this dataset:

chr15   10  20  *
chr15   10  20  *
chr15   40  50  +

returns: chr15 10 20 + 2 1 1

If instead is applied to the dataset (note the strand of the last region):

chr15   10  20  *
chr15   10  20  *
chr15   40  50  -

returns: chr15 10 20 * 2 1 1

I.e., in the first case the strand is changed to +, but in the second is not changed to -.

According to the documentation I think the first behaviour is the right one (the defined one). Can you @marcomass & @sunbrn confirm?

I will modify the code accordingly

sunbrn commented 6 years ago

The documentation says "When regions are stranded, COVER is separately applied to positive and negative strands. In this case, unstranded regions are accounted both as positive and negative."

So why do you say that the first example output should contain a region with + strand and the second example a region with - strand? From what I understand, the third region in the input samples should never be considered (it does not intersect with others), while the first two regions would intersect (thus count to 2, thus be selected by the COVER(2,ANY)) both if they had - strand and if they had + strand. I would expect the output to contain either chr15 10 20 * 2 1 1 or... chr15 10 20 + 2 1 1 chr15 10 20 - 2 1 1 (in case "applied to positive and negative strands" means that there is an actual duplication of unstranded regions)

Anyways: I did not participate to the COVER definition, so I summon @marcomass for this ;)

marcomass commented 6 years ago

Thanks @pp86 for spotting the issue. @sunbrn is right, with the clarification that the first "expected" output provides similar information as the second "expected" output, but without the artifact of duplicating the region. So, the right output is the first one in both cases, i.e. chr15 10 20 2 1 1 I clarify: since the third region: chr15 40 50 + or chr15 40 50 - is not overlapping with the other two, it should not affect the result of the first two regions; therefore, the regions contributing to the result are homogeneous from the strand point of view, and thus, the output covered region must have the same strand of the two contributing ones, i.e. in this example.

The documentation is actually not very clear about it; I'd change it as follow: "When regions are stranded, COVER is separately applied to positive and negative strands. In this case, when also unstranded regions are present, if they contribute to a resulting region together with stranded regions, unstranded regions are accounted both as positive and negative"; is it clearer in this way?

Please make the implementation consistent with this.

Note that usually in the same sample there is not a mix of stranded and unstranded region; do you have an example of such a case?

pp86 commented 6 years ago

What I see in current implementation follows the logic:

  1. if the samples to COVER contain at least a region with + strand, at least a region with - strand and at least a region with strand: all the regions with strand are duplicated (a copy with + and a copy with - strand)
  2. if samples only contain regions with and + strand, then all the strands are changed to +
  3. if samples only contain regions with and - strand, then all the strands are changed to -
  4. if samples only contain regions with * strand, than the strand is unchanged.

Only after this change the cover is applied. So the fact that the last region in the example do not overlap will not change the result, since the strand is "computed" before the evaluation of the overlap.

I totally agree that your interpretation is much more meaningful, but since this is very different from what it is implemented, I wanted to be sure.

marcomass commented 6 years ago

I remember that the choice you describe was taken with Abdulrahman as a first approximation in the light of limiting as much as possible the computational overhead (thus, reducing as less as possible the performance) for a case (mix of stranded and unstranded regions) which was felt to occur very rarely. Although this is very true for input datasets, with current GMQL power datasets generated by a set of GMQL operations can include a mix of stranded and unstranded regions (although I think this is not frequent).

I would keep similar guideline of reasoning: try to implement the most correct solution that can impact as less as possible on most usual cases. The logic that you describe has the great disadvantage that a single stranded region in a dataset of unstranded regions can fully change the result; the worst situation is when 1 strand + and 1 strand - region are in a big dataset of unstranded regions: in this case all unstranded regions are duplicated in + and - strand regions with artifact duplication of the results (besides their arbitrary stranded attribution).

In Summary, the best is if you can find a solution that improves the current logic, particularly avoiding the influence of single regions on whole genome results, with not much overhead, particularly in the most usual case in which only unstranded, or only stranded, regions are in the dataset.

Please @pp86 let me know which is the final implementation, so that I can correctly describe it in the document.

pp86 commented 6 years ago

@marcomass why did you remove the "enhancement" tag? Didn't we decide to postpone this?

marcomass commented 6 years ago

It is not an enhancement; it is a bug that can provide wrong results in some case, as you highlighted.

marcomass commented 6 years ago

We decide to modify the cover implementation as follow:

In summary: When regions are stranded, COVER is separately applied to positive and negative strands, unless also unstranded regions are present; in this case, all regions are considered unstranded only.

marcomass commented 6 years ago

@pp86 @acanakoglu Please upload here the dataset used for testing the issue fix

marcomass commented 6 years ago

@pp86 @acanakoglu Can you please provide the dataset used for testing this issue fix?

acanakoglu commented 6 years ago

You can test any stranded DS by changing one of its line to unstranded. You can use the one below

files2.zip

This query can be used for testing on the above DS: IN = SELECT() testCOVER; IN2 = SELECT(region: NOT(chr == chr9)) testCOVER; OUT = COVER(1,ANY) IN; OUT2 = COVER(1,ANY) IN2; MATERIALIZE OUT INTO OUT; MATERIALIZE OUT2 INTO OUT2;