DEIB-GECO / GMQL

GMQL - GenoMetric Query Language
http://www.bioinformatics.deib.polimi.it/geco/
Apache License 2.0
18 stars 11 forks source link

Percentage threshold of a region attribute values #38

Closed marcomass closed 7 years ago

marcomass commented 7 years ago

I noticed that for the EXTEND, besides the usual aggregate functions MIN, MAX ... are also implemented the aggregate functions Q1, Q2, Q3 providing respectively the quartiles of the values of the specified region attribute. This is very good! Even better is to provide a general function (e.g. named perc()) which returns the value of a specified region attribute corresponding to a given percentage of the values of that attribute. E.g.: suppose to have a sample with 100 regions with integer incremental values for the attribute score (i.e. 1,2,3,...99,100); the function perc(5, score) would return the value 5, corresponding the fifth region (based on the values of the score attribute) of the sample. Can this additional aggregate function be implemented for the EXTEND?

akaitoua commented 7 years ago

@marcomass, If i understand this correctly. This can be done by performing a select on score column with a value equal to 5 before performing extend. I would prefer keeping it like this.

marcomass commented 7 years ago

@akaitoua performing the select is needed anyway, as you mention. This issue is about calculating the threshold (in the example, 5) to be used in the predicate of the select. In general we can have any distribution of the values of the region attribute (in the example, score), and we want to calculate which of these values is the one corresponding to the region in position k%, when the regions are ordered based on that attribute value. So, the function perc(k, score) [with k = 5, in the example] actually calculates this value and store it in the metadata, so that it can be used in the SELECT predicate (when the issue currently managed by Olya will be closed). Is it clearer now?