DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

Math functions (basic) both for metadata and region attribute values #58

Closed marcomass closed 7 years ago

marcomass commented 7 years ago

Enable possibility of basic math functions such as sqrt (and exponential) over a metadata (or possible region) attribute

akaitoua commented 7 years ago

@marcomass, I do not see this in the scope of our engine. Our engine is designed for performing Genometric Queries that are focused on regions operations on big data engines. This funcitons can be performed easily with other tools, we can not create a Swiss knife, it will make our system so complicated even to learn.

marcomass commented 7 years ago

@akaitoua Actually the scope of this issue is fully in line with what you mention, i.e. to fully support operations on big data genomic regions, beyond the trivial processing we were used in V1. I give you a simple example, just to give the idea. Let's suppose you do an histogram on the regions of several ChIP-seq samples (after merging them) calculating the accumulation index of the original regions in each of the regions identified by the histogram function. Then, you want to select only the new histogram regions that have an accumulation index greater than the median accumulation index + 2 standard deviations of such accumulation index (this is the usual way to select interesting outliers). Now you can calculate the median of the accumulation index and store it in the metadata, but you have not the possibility of calculating its standard deviation (and thus the threshold median+2*std). If basic math functions (sqrt, at least) would be available over the metadata, this would be possible, then it would be possible to select the region with outlier accumulation, and then use them as a mask to select specific genomic regions in other experiments. By the way, this is part of the processing to identify dense genomic regions, a project we were trying to develop with the IIT people since a couple of years, as you may remember. Are you better convinced now?

marcomass commented 7 years ago

Ok, so let's implement the sqrt function to be used on metadata attributes (which can be casted to numeric). If possible/easy lat's make this sqrt function available to be applicable also on region attributes. Pietro will add it at the compiler level.

akaitoua commented 7 years ago

SQRT is added for both Meta and regions.

I added: RESQRT()

Compiler needs to be updated to consider SQRT calculations.

marcomass commented 7 years ago

@pp86 Unfortunately I have to reopen this issue; please fix both the following 2 errors:

1) the SQRT(attribute_name) function is not recognized when applied on a metadata attribute (see the following query): DATA = SELECT(cell == "Urothelia" AND ID == "40") HG19_ENCODE_NARROW; RES = PROJECT(metadata_update: _value AS 9; region_update: myValue AS 9) DATA; RES2 = PROJECT(metadata_update: _valueSQRT AS SQRT(_value)) RES; MATERIALIZE RES2 into res2;

2) it generates the following DS_CREATION_FAILED runtime error when applied on a region attribute as follow (no problem when applied on a constant, e.g., AS SQRT(9)): DATA = SELECT(cell == "Urothelia" AND ID == "40") HG19_ENCODE_NARROW; RES1 = PROJECT(region_update: myValueSQRT AS SQRT(signal)) RES; MATERIALIZE RES1 into res1;

2017-09-07 19:04:25,058 ERROR [Executor] Exception in task 0.0 in stage 15.0 (TID 213) scala.MatchError: REFieldNameOrPosition(FieldName(signal)) (of class it.polimi.genomics.compiler.REFieldNameOrPosition) at it.polimi.genomics.GMQLServer.DefaultRegionExtensionFactory$.make_fun(DefaultRegionExtensionFactory.scala:63) at it.polimi.genomics.GMQLServer.DefaultRegionExtensionFactory$$anonfun$make_fun$19.apply(DefaultRegionExtensionFactory.scala:168) at it.polimi.genomics.GMQLServer.DefaultRegionExtensionFactory$$anonfun$make_fun$19.apply(DefaultRegionExtensionFactory.scala:168) at it.polimi.genomics.spark.implementation.RegionsOperators.ProjectRD$.computeFunction(ProjectRD.scala:54) ....

pp86 commented 7 years ago

@marcomass @akaitoua

For the point (1) I fixed it; point (2) is not under my control. I am removing my assignment here.

akaitoua commented 7 years ago

@pp86, i checked (2) but it looks that the compiler is sending a wrong arguments to the region extension factory in case of using a column name.

pp86 commented 7 years ago

@akaitoua could you provide further details?

marcomass commented 7 years ago

@pp86 In the last version available on the web, point (1) (SQRT(metadata_attribute_name) function is not recognized when applied on a metadata attribute, e.g., PROJECT(metadata_update: _valueSQRT AS SQRT(_value)) RES;) seems still not fixed; did you push the fix to the main branch compiled by Arif?

akaitoua commented 7 years ago

@pp86 , The compiler is sending types in the RESQRT(REFieldNameOrPosition()) that is defined in the Compiler only and it is not recognised by the core types (REFieldNameOrPosition is only defined in the Compiler). I expect to have one of the types in here

pp86 commented 7 years ago

@akaitoua ok, tnx. Fixed it.

marcomass commented 7 years ago

@pp86 Unfortunately the SQRT(...) function is not recognized yet when used in metadata_update, either when applied on a metadata attribute or on a constant. Can we close this soon since it is needed by a student?

Please see below testing query and make it compilable. DATA = SELECT(cell == "Urothelia" AND ID == "40") HG19_ENCODE_NARROW; RES = PROJECT(metadata_update: _value AS 9; region_update: myValue AS 9) DATA; RES2 = PROJECT(metadata_update: _valueSQRT AS SQRT(_value)) RES; RES3 = PROJECT(metadata_update: MYcell_tier AS SQRT(9)) RES2; MATERIALIZE RES3 into res3;

lucananni93 commented 7 years ago

added support also in the PythonAPI