DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

Join of regions based on any attribute #52

Closed marcomass closed 7 years ago

marcomass commented 7 years ago

Enable also join of regions based on any generic region attributes, non only metrics joins based on coordinates

akaitoua commented 7 years ago

@marcomass , Is the goal of this to select the start and stop from the values? or just a normal database join based on a column with a condition for joining?

marcomass commented 7 years ago

@akaitoua I do not get completely your question. It would be as a database join, on any possible region attribute specified by the user; where the result selects samples, with all their region attributes, including start and stop.

akaitoua commented 7 years ago

@marcomass, Our Genometric Join is a domain specific join and thus implemented with Binning algorithms for better performance.

As i understand, please correct me if i am wrong, What you are referring to is a normal Join operation. Since, we have Scala, Python APis, i really do not see why i should implement such behaviour which is not specific for genomic operations. So, Whoever uses Scala or Python APIs can use a normal join operation of spark to perform this operation.

marcomass commented 7 years ago

@akaitoua
You are right, as far as regards Scala; I'm not sure about python, since this would require to create and execute the GMQL dag, than perform the normal join in python, then going back to GMQL dag for other operations. Probably in R this would be even more difficult. In any case this would be feasible only by informatic people, while here the goal would be to make available additional functionalities also to biologists or bio-informaticians through the use of the GMQL language via web interface, so that the language would be auto-contained to make possible to express the whole set of operations required to process the genomic data and metadata. Of course this has to be evaluated as a trade off between the gained functionalities and the difficulty in implementing them; if it requires normal join operations of spark easy to be implemented, probably it is worth to be done, as an additional feature. In any case, please consider that this is tagged as milestone 2.3, so as something additional that can be discussed and designed first, since it also requires its syntax definition at compiler level. Please let me know if, after this clarification, you consider worth going ahead with this.

akaitoua commented 7 years ago

@marcomass , This will be a good feature but i do not support implementing it. Our main contribution in GMQL is in the domain specific operations such as Genometric Join, and once we start changing the syntax of Join to support normal join, we will loose our contribution and we will be come like an API for Spark. I really do not support this.

akaitoua commented 7 years ago

@marcomass, These are the notes of Prof. Ceri, Do you have any inputs? Abdo makes some questions about the join implementation that are of general interest. From Pietro’s email it is clear that we only support equijoin between pairs of columns of the regions. Such equijoin can either be with a genometric join or without it. The first case is simple: if it comes after a genometric join, it filters the regions resulting from it; the genometric join occurs at same chromosome. I let as an optimization choice to decide which clause should be executed first, but the simplest solution is to do the genometric join without changes and then with a filter the regions that do not satisfy the join condition.

If instead we consider a region join that has no genometric predicate (it can be something like “A.GENE=B.GENE” where A,B are two datasets having GENE in their schema) then we know how to do the join, whose schema is the concatenation of the two schemas. However, if we don’t include CHROM in the join list, then the result may have attributes LEFT.CHROM and RIGHT.CHROM with different values. We solve this problem by taking as resulting regions either the LEFT or the RIGHT projection, but no intersection or union – which are undefined when the chromosomes are distinct.

So this problem is moved to the compiler: if (a) there is no distal join (which implicitly imposes that CHROM is the same) and (b) CHROM is not an attribute of the join list and (c) the join type is intersection or concatenation, then the compiler says that this join is not legal. Pietro, please confirm. If that is the case, the problem does not occur at the implementation. Abdo, I think this solves your question.

In general we are left with the problem of duplicates, e.g. resulting by all regions that are identical in the left (or right) but come from different tuples before projection (with or without the same chromosome). Marco would like them to be removed in V2.1, as he says we had them already in V1, and I know he discussed this with Pietro.

We could say LEFT DISTINCT or RIGHT DISTINCT to remove them. This means adding DSTINCT only as an option of RIGHT and LEFT.

With the semantics that DISTINCT LEFT will only produce one tuple (with the LEFT schema) out of the many tuples resulting from the join which have the same LEFT and different RIGHT.

akaitoua commented 7 years ago

@marcomass, The issue is solved. In addition to: LEFT, RIGHT, INTERSECTION, CONTIG I added: LEFT_DISTINCT, RIGHT_DISTINCT, BOTH_LEFT, BOTH_LEFT_DISTINCT, BOTH_RIGHT, BOTH_RIGHT_DISTINCT Where: // BOTH_LEFT, keeps both regions while the left region will be in the coordinates position and the right region coordinates will be as att/Val. // BOTH_RIGHT, keeps both regions while the right region will be in the coordinates position and the left region coordinates will be as att/Val.

API, example:

ds1.JOIN(None, List[JoinQuadruple](), RegionBuilder.BOTH_LEFT_DISTINCT, join_on_attributes = Some(List((0, 0))), right_dataset = ds2)

pp86 commented 7 years ago

I commited the changes to allow this in the compiler.

The syntax is: B = JOIN(output: right; on_attributes: chr,score) A A;

Notice that the attributes must have the same name in both input dataset. If this is not the case, attributes have to be renamed by a preliminary PROJECT.

pp86 commented 7 years ago

@akaitoua please notice that the query:

A = SELECT(parser: bedscoreparser) /Users/pietro/Desktop/test_gmql/input3;
B = JOIN(output: right; on_attributes: chr,score) A A;

gives an error at runtime. If only score (no chr) is specified, the execution successes.

Input dataset: only one sample

chr1    12000   33000   0
chr2    12000   33000   0
chr1    12000   33000   1
akaitoua commented 7 years ago

Thanks, @pp86 for the note. I did not consider Chr, start, or stop. I will include them in the joining.

marcomass commented 7 years ago

@akaitoua I think all attributes and coordinates should be included, also strand

marcomass commented 7 years ago

@akaitoua Please update your comment above specifying only what output options are implemented.
Is BOTH implemented? It is not recognized at compiler level ....

akaitoua commented 7 years ago

@marcomass, The final implementation after a list of discussions contains LEFT_DISTINCT, RIGHT_DISTINCT, and BOTH . @pp86, would you add BOTH to the compiler. All attributes and coordinates including the strand are now considered from the implementation point of view of the equi-Join.

marcomass commented 7 years ago

@pp86 I reopen this issue as you ask, to add BOTH to the compiler.

pp86 commented 7 years ago

@marcomass thanks