Closed marcomass closed 6 years ago
This includes (also) defining predicates to identify regions that coincide, one contains / is contained in another one, one prefix / suffix another one (i.e., is contained and adjacent to the end of another region)
Issue regarding DLE(-1), or DIST > -3 has been fixed (as confirmed by andreagulino) by correcting definition in language definition doc.
Still remain to complete metric operations that can be performed, as other competitors (es. START). This includes (also) defining predicates to identify regions that coincide, one contains / is contained in another one, one prefix / suffix another one (i.e., is contained and adjacent to the end of another region).
@marcomass it is not easy for me to implement the new features (metric operations / predicates) in a short time, both because I am not familiar with changing the implementation of operators and because we need to check if the implementation of each features you suggested is compatible with our binning theory. @akaitoua should know more about what can be done on the implementation we have at the moment to include these new features.
akaitoua pointed out incompleteness of the implementation of condition DLE(-n) for different values of n (negative). Please @akaitoua complete implementation so that DLE(-n) and DL(-n) is correctly managed for all possible values of n according to the distance definition of overlapping regions.
The genomic distance is defined as the number of base pairs (i.e., nucleotides) between the closest opposite ends of two regions belonging to the same chromosome, measured from the right-end of the region with left-end lower coordinate. DL(0) selects experiment regions overlapping an anchor region regardless the amount of their overlapping. DLE(n) and DL(n) with n < 0 search for regions of the experiment which overlap with the anchor region, and have a genomic distance (as defined above) at most of (or less than) n bp from the anchor region.
The algorithm should probably be as follow:
While implementing this, please fix analogously also for DG(n) and DGE(n) with n < 0, so that then we can allow this also at the compiler level.
Unfortunately this is not fixed; in case of n < 0 issues still remain. I reopen it. Please use the following query and datasets for testing, where RES3 should include a sample with 2 single base regions and another sample with 1 region of 21 bases. D1 = SELECT(region: chr == chr2) Example_Dataset_1t; D2 = SELECT(region: chr == chr2) Example_Dataset_2t; RES3 = JOIN(DLE(-30); output: INT; joinby: cell_karyotype) D1 D2; MATERIALIZE RES3 INTO join_9; Example_Dataset_2t.zip Example_Dataset_1t.zip
We ( @lucananni93 , @sunbrn , @acanakoglu , @andreagulino ) discussed and we wrote the following definition: The distance between regions r1 r2if defined as: The genomic distance is defined as the number of base pairs (i.e., nucleotides) between the closest opposite ends of two regions belonging to the same chromosome (nota: questa quantità è sempre positiva). If the two regions overlap, then it is returned as the negative number of the above definition.
Then, DLE(n) is evaluated in the same way both for negative and positive values of N.
In the example you reported, the result is made by only one region:
chr2 199 240 * GMQL Region 9 . . 22 19.6922 -1 -1 GMQL Region 12 . . 20 -1 -1 -1
for which the input regions are:
chr2 129 350
chr2 199 240
and therefore the distance is -111 = -min(240-129=111, 350-199=151)
If one wants regions with relative distance in the range (-30, 0) should do : DLE(0), DGE(-30)
@pp86 , @lucananni93 , @sunbrn , @acanakoglu , @andreagulino Ok for changing the definition and the consequent evaluation of DLE(n), if this satisfies all requirements. In so doing, the result for the testing example I posted above is the one I mentioned, i.e.: RES3 includes a sample with 2 single base regions and another sample with 1 region of 21 bases?
Regarding way to extract regions with relative distance in the range (-30, 0) that you propose, did you verify that it is actually possible and provides the correct result?
As far as defined in the current documentation, DGE(n) accepts only non negative values of n. Furthermore, although DGE(n) would accept also negative values of n, according to the definition of DGE(n), I think DGE(-30) looks for regions at greater distance than 30 bases, i.e. not with relative distance in the range (-30, 0). What would be the correctly working predicate to extract overlapping regions with a distance in the range (-30, 0)?
Yes, the two single bases regions are extracted as well:
chr2 399 400 * GMQL Region 0.9 . . 15 15.1452 -1 -1 GMQL Region 5 . . 9 10.8387 -1 -1
chr2 539 540 * GMQL Region 0.9 . . 15 15.1452 -1 -1 GMQL Region 0.8 . . 9 4.65281 -1 -1
We confirm that in order to extract regions in the range (-30, 0) one should do: JOIN(DLE(0), DGE(-30))
DGE(-30) will accept all the region pairs whose relative distance is at least -30.
@pp86 Ok, so now does DGE(n) accept also negative values of n?
Then, despite the new definition of relative distance, the distance of overlapping regions is considered a negative distance, so that "relative distance is at least -30" means with distance >=-30, i.e. <abs(-30), i.e. less overlapped. Is it so?
And does this apply also to DL(n) with n < 0? i.e., DL(-20) means distance <-20, i.e. >abs(-20), i.e. more overlapped?
@sunbrn All this must be made clear in the documentation ...
Yes, DGE also accepts negative parameter.
About your second point: despite this is often true, it is not necessary. For example, consider the anchor region 1 1000
Then the region 5 10 has distance -9 and the region 105 110 has distance -109 But they both overlap 5 bases
Fix implementation of genometric predicates, based on what highlighted by Andrea, including the cases with negative distance values (e.g. DLE(-1, or DIST > -3)), and complete metric operations that can be performed, as other competitors (es. STRART) do.