Genometric predicates completion and fixing

marcomass commented 7 years ago

Fix implementation of genometric predicates, based on what highlighted by Andrea, including the cases with negative distance values (e.g. DLE(-1, or DIST > -3)), and complete metric operations that can be performed, as other competitors (es. STRART) do.

marcomass commented 7 years ago

This includes (also) defining predicates to identify regions that coincide, one contains / is contained in another one, one prefix / suffix another one (i.e., is contained and adjacent to the end of another region)

marcomass commented 7 years ago

Issue regarding DLE(-1), or DIST > -3 has been fixed (as confirmed by andreagulino) by correcting definition in language definition doc.

Still remain to complete metric operations that can be performed, as other competitors (es. START). This includes (also) defining predicates to identify regions that coincide, one contains / is contained in another one, one prefix / suffix another one (i.e., is contained and adjacent to the end of another region).

andreagulino commented 7 years ago

@marcomass it is not easy for me to implement the new features (metric operations / predicates) in a short time, both because I am not familiar with changing the implementation of operators and because we need to check if the implementation of each features you suggested is compatible with our binning theory. @akaitoua should know more about what can be done on the implementation we have at the moment to include these new features.

marcomass commented 7 years ago

akaitoua pointed out incompleteness of the implementation of condition DLE(-n) for different values of n (negative). Please @akaitoua complete implementation so that DLE(-n) and DL(-n) is correctly managed for all possible values of n according to the distance definition of overlapping regions.

The genomic distance is defined as the number of base pairs (i.e., nucleotides) between the closest opposite ends of two regions belonging to the same chromosome, measured from the right-end of the region with left-end lower coordinate. DL(0) selects experiment regions overlapping an anchor region regardless the amount of their overlapping. DLE(n) and DL(n) with n < 0 search for regions of the experiment which overlap with the anchor region, and have a genomic distance (as defined above) at most of (or less than) n bp from the anchor region.

The algorithm should probably be as follow:

In case of DL(n) and DLE(n) with n < 0, for any pair of anchor and experiment regions, if the region distance is < 0 (i.e. they overlaps), if the distance absolute value is less (or equal) to the absolute value of n, then select the regions as a valid region pair.

While implementing this, please fix analogously also for DG(n) and DGE(n) with n < 0, so that then we can allow this also at the compiler level.

marcomass commented 6 years ago

Unfortunately this is not fixed; in case of n < 0 issues still remain. I reopen it. Please use the following query and datasets for testing, where RES3 should include a sample with 2 single base regions and another sample with 1 region of 21 bases. D1 = SELECT(region: chr == chr2) Example_Dataset_1t; D2 = SELECT(region: chr == chr2) Example_Dataset_2t; RES3 = JOIN(DLE(-30); output: INT; joinby: cell_karyotype) D1 D2; MATERIALIZE RES3 INTO join_9; Example_Dataset_2t.zip Example_Dataset_1t.zip

pp86 commented 6 years ago

We ( @lucananni93 , @sunbrn , @acanakoglu , @andreagulino ) discussed and we wrote the following definition: The distance between regions r1 r2if defined as: The genomic distance is defined as the number of base pairs (i.e., nucleotides) between the closest opposite ends of two regions belonging to the same chromosome (nota: questa quantità è sempre positiva). If the two regions overlap, then it is returned as the negative number of the above definition.

Then, DLE(n) is evaluated in the same way both for negative and positive values of N.

In the example you reported, the result is made by only one region: chr2 199 240 * GMQL Region 9 . . 22 19.6922 -1 -1 GMQL Region 12 . . 20 -1 -1 -1 for which the input regions are: chr2 129 350 chr2 199 240 and therefore the distance is -111 = -min(240-129=111, 350-199=151)

If one wants regions with relative distance in the range (-30, 0) should do : DLE(0), DGE(-30)

marcomass commented 6 years ago

@pp86 , @lucananni93 , @sunbrn , @acanakoglu , @andreagulino Ok for changing the definition and the consequent evaluation of DLE(n), if this satisfies all requirements. In so doing, the result for the testing example I posted above is the one I mentioned, i.e.: RES3 includes a sample with 2 single base regions and another sample with 1 region of 21 bases?

Regarding way to extract regions with relative distance in the range (-30, 0) that you propose, did you verify that it is actually possible and provides the correct result?

As far as defined in the current documentation, DGE(n) accepts only non negative values of n. Furthermore, although DGE(n) would accept also negative values of n, according to the definition of DGE(n), I think DGE(-30) looks for regions at greater distance than 30 bases, i.e. not with relative distance in the range (-30, 0). What would be the correctly working predicate to extract overlapping regions with a distance in the range (-30, 0)?

pp86 commented 6 years ago

Yes, the two single bases regions are extracted as well:

chr2    399 400 *   GMQL    Region  0.9 .   .   15  15.1452 -1  -1  GMQL    Region  5   .   .   9   10.8387 -1  -1
chr2    539 540 *   GMQL    Region  0.9 .   .   15  15.1452 -1  -1  GMQL    Region  0.8 .   .   9   4.65281 -1  -1

We confirm that in order to extract regions in the range (-30, 0) one should do: JOIN(DLE(0), DGE(-30))

DGE(-30) will accept all the region pairs whose relative distance is at least -30.

marcomass commented 6 years ago

@pp86 Ok, so now does DGE(n) accept also negative values of n?

Then, despite the new definition of relative distance, the distance of overlapping regions is considered a negative distance, so that "relative distance is at least -30" means with distance >=-30, i.e. <abs(-30), i.e. less overlapped. Is it so?

And does this apply also to DL(n) with n < 0? i.e., DL(-20) means distance <-20, i.e. >abs(-20), i.e. more overlapped?

@sunbrn All this must be made clear in the documentation ...

pp86 commented 6 years ago

Yes, DGE also accepts negative parameter.

About your second point: despite this is often true, it is not necessary. For example, consider the anchor region 1 1000

Then the region 5 10 has distance -9 and the region 105 110 has distance -109 But they both overlap 5 bases

DEIB-GECO / GMQL

Genometric predicates completion and fixing #51