DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

Enable use of metadata attributes in predicates on region attributes #15

Closed marcomass closed 7 years ago

marcomass commented 7 years ago

Enable use of metadata attributes in predicates on region attributes [e.g. B = SELECT(region: AccIndex == maxCount) A;] o In the case of multivalued metadata attribute, we’ll pick one randomly (WE ARE AWARE THAT DOING SO THE QUERY IS NOT DETERMINISTIC) o Implementation, requires changes at all the levels (compiler, DAG, execution)

marcomass commented 7 years ago

It should be supported also the case: S2 = PROJECT(region_update: new_region_attribute AS g; metadata_update: new_metadata_attribute AS g) S1; where g is a metadata attribute or a function of metadata attributes.

pp86 commented 7 years ago

@akaitoua @marcomass I checked, and we have a metadata accessor for the SELECT (it.polimi.genomics.core.DataStructures.RegionCondition.MetaAccessor) but not for the PROJECT.

Furthermore, the DAG operator for the Region Projection does not take the metadata as input, therefore, by now, it is not possible to implement this at compile time.

akaitoua commented 7 years ago

@marcomass, since Pietro mentioned " it is not possible to implement this at compile time", should we close this ? @pp86, I would implement it form the DAG down if it is not going to make a problem for the compiler, Then the user will get a run time error if his attribute "g" (as mentioned in the above example) is not found in the metadata.

marcomass commented 7 years ago

@pp86 @akaitoua this issue does not explicitely refers to compile or to implementation, but just to the functionality. I would close it when the functionality is available.

I do not get the meaningg of your second comment (about g); I could be wrong, but I think that if g is not found, in the case "metadata_update: new_metadata_attribute AS g" noting would happen in the metadata, whereas in the case of "region_update: new_region_attribute AS g", the new_region_attribute is created with null value. In any case non run time error is generated (which usually create problems), but in case just a warning in the log. What do you think?

akaitoua commented 7 years ago

@marcomass, I agree on adding warnings to the log and null values to the data. @pp86 , if you agree we can start implementing it. I can do like we did for the last COVER issue; I will create a branch and fix the problem from my side then provide you with the signatures for the compiler. what do you think?

pp86 commented 7 years ago

@akaitoua , I agree with what you proposed. I think the correct way to do that is to first implement the DAG and the engine(s) and then plug in the compiler.

For what it concerns the signature, please choose whatever is better from your side.

akaitoua commented 7 years ago

I implemented Projection on regions with extending columns with values from Meta data (directly from the meta based on IDs) or from an aggregation function on regions columns including an attribute or more from Meta in the aggregation.

I updated the Scala API and the documentation.

If the value in meta can be casted to Double, it will be casted, otherwise it is considered String. In case of missing value I added NULL.

example:

 val fun = new RegionExtension {
            override val fun: (Array[GValue]) => GValue = {x=>if( x(1).isInstanceOf[GDouble]) GDouble(x(0).asInstanceOf[GDouble].v + x(1).asInstanceOf[GDouble].v)else GNull()}
            override val inputIndexes: List[Any] = List(0,MetaAccessor("score"))
          }
      dataAsTheyAre.PROJECT(None,extended_values = Some(List(fun)))
pp86 commented 7 years ago

@akaitoua it is better to set NULL also the cases in which it cannot be casted, otherwise we could end up having a dataset which contains some samples with a double and some samples with a string.

Did you (or will you) push on master? I will work on the compiler side.

akaitoua commented 7 years ago

I pushed on the master. If i consider everything as Double then Marco can not copy any attribute from meta to make it a column in regions if it is not double, I do not think that marco wants this. right @marcomass ?

marcomass commented 7 years ago

@akaitoua @pp86 Sorry, I'm not able to follow all what you wrote. I think the answer is right. Anyway the examples are at the beginning of this thread: there are two cases, one for SELECT and one for PROJECT 1- SELECT: e.g. B = SELECT(region: region_attribute_name OP metadata_attribute_name) A;] where OP can be as usual <, <=, ==, >=, >, also depending on the type (string or double) of the region_attribute_name. Thus, I think here cast is performed or not depending on the type of the region_attribute_name.

2- PROJECT: e.g. S2 = PROJECT(region_update: new_region_attribute AS g) S1; where g is a metadata attribute or a function of metadata attributes. Here, if I'm not wrong, if g is a function (of metadata attributes), the function itself defines the type of the new_region_attribute. If g is a metadata, there could be the issue mentioned by Pietro. I do not see a unique solution to this case/issue, unless maybe setting as default new_region_attribute as string unless it is differently specified in the GMQL statement, e.g. S2 = PROJECT(region_update: new_region_attribute[double] AS g) S1; What do you think?

What is our standard for null? I think in the UNION with different schemas implemented by Olya is null (lower letter), not NULL. If not yet done so, can constant be defined to set the lable for null, and always use this same constant in all implementation iclasses, instead of a lable redefined in the different classes of the implementation? By the way, should it be used only for double attributes, while for strings it should be an empty string, right?

Here I didn't mention the operation on metadata, i.e., S2 = PROJECT(metadata_update: new_metadata_attribute AS g) S1; since it applies to the other issue, right?

OlgaGorlova commented 7 years ago

Hi @pp86 , I created the DefaultMetaExtensionFactory and committed it to master. You can use it now. Note: please use the MetaExtension trait instead of MetaAggregateStruct if needed. Let me know if you have any issues.

pp86 commented 7 years ago

@marcomass

For the point 1, it is already possible with the following syntax, e.g.:

T = SELECT(region: score > META(avg_score)) GRCh38_ENCODE_BROAD_MAY_2017;

marcomass commented 7 years ago

@pp86 Please support also point 1, i.e. the case: S2 = PROJECT(region_update: new_region_attribute AS g; metadata_update: new_metadata_attribute AS g) S1; where g is a metadata attribute or a function of metadata attributes.

At compiler level it must be included the specification of the type of the new region attribute (default: string if g is a constant value; if g is a function of metadata attributes, the type provided by the g function).
e.g. S2 = PROJECT(region_update: new_region_attribute[double] AS g) S1;

If g is null (constant or function output), output should be set accordingly (null for double, empty string otherwise).

pp86 commented 7 years ago

Reading a metadata attribute into a region field.

The syntax for doing so is: new_field as META(attribute, TYPE) e.g.: age2 as META(age, INTEGER) + 10

NB: the META accessor cannot be composed by means of operators, i.e.

age2 as META(age, INTEGER) + 10 age2 as META(age, INTEGER) + score age2 as META(age, INTEGER) * META(factor, DOUBLE)

are NOT VALID. Notice all of the previous, can be created by two consecutive PROJECTs (in the first load the metadata(s) within a region field(s), in the second apply the arithmetic function).

marcomass commented 7 years ago

@akaitoua Unfortunately I have to reopen this issue since everything seems working fine, but the generated xml schema included in the output dataset set as DOUBLE all new created region attributes, also those created as STRING (see attribute cell and mytext in example query below).

I suppose TYPE can assume the following values only, as in issue4 :

A test query, you can use the following: DATA = SELECT(cell == "Urothelia" AND ID == "40") HG19_ENCODE_NARROW; D0 = PROJECT(metadata_update: value AS 4.7, mytext AS "Hello") DATA; D = PROJECT(region_update: cell AS META(cell, STRING), mytext AS META(mytext, STRING), cell_tier AS META(cell_tier, INTEGER), cell_tier2 AS META(cell_tier, DOUBLE), value AS META(value, DOUBLE)) D0; MATERIALIZE D into D;

akaitoua commented 7 years ago

The schema is set at the compile time. @pp86 Would you give it a look. ? Thanks.