forcedotcom / phoenix

BSD 3-Clause "New" or "Revised" License
558 stars 227 forks source link

Add built-in function to bucketize a BigDecimal range #220

Open eldokk opened 11 years ago

eldokk commented 11 years ago

I have a select query to apply aggregation for specific groups and for different key sets .

The best approach i think let Phoenix do the aggregation (its fast and i guess it make use of co-processors) but i am missing the option of UNION in select query so that i can find aggregate count for selected key range .

Example : select item,volume,sum(price) form Order where k1 < value1 and k1 > value2 group by item,volume

suppose i am plotting a chart with different range (i can say multiple bin size) i need to split the above query into ,

select item,volume,sum(price) from Order where k1 < x.1 and k1 > x.100 group by item,volume UNION ALL select item,volume,sum(price) form Order where k1 < x.100 and k1 > x.200 group by item,volume

Does phoenix allow this or i need to pull all the data to the client side do aggregation and binning at client side ?

jtaylor-sfdc commented 11 years ago

For your example, just use OR: select item,volume,sum(price) from Order where (k1 < x.1 and k1 > x.100) or (k1 < x.100 and k1 > x.200) group by item,volume

This will be fast - it'll take advantage of our new skip scan.

Union is not supported yet.

eldokk commented 11 years ago

Thanks for your reply jtaylor.

I think skip scan will give considerable improvements for certain use-cases .I have a table with 20 Million rows for testing , and each described above query will return 25 percentage of total rows after group by .I think the better solution would be to segment rows for certain key range and do group by .If one segment has 10K rows i can do group by on each such segment (may be call it as bin) and get aggregation for specified values

Suppose there 1 million rows that match the key range and if my segment size is 10K (k > x.1 and k< x.10000,k>x.10000 and k<20000 ...k> 1m-10k and k< 1m) the total rows in the out put will be 100 rows after group by.

Here is the point ,getting all 1 million rows to client side and do aggregation logic (may be the same segment logic ,group by on multiple range ,that is a requirement ) will not perform well .I miss the advantage of co-processing server side .

I really looking for group by on multi-range or union clause in select (I think this will internally a multi select query) included in Phoenix.

If you could guide me on this that will be great .

Thanks -Eldo

jtaylor-sfdc commented 11 years ago

Phoenix does the filter and the group by on the server. You don't need union. Can you post a schema and query that is slower than expected and describe your particular use case?

eldokk commented 11 years ago

Hi James  ,   Please find below the schema and use-cases , CREATE TABLE IF NOT EXISTS well_20_million (  well_depth VARCHAR NOT NULL,  producttype VARCHAR NOT NULL,  runno VARCHAR NOT NULL,  sampledate DATE,  count INTEGER,  purity DECIMAL,  volume DECIMAL,  temperature INTEGER,  operatorname VARCHAR,  carbon_content DECIMAL,  sampleId INTEGER,  timepoint INTEGER,  volumeanalyzed DECIMAL,  CONSTRAINT PK PRIMARY KEY (well_depth, producttype, runno) );

There are multiple runs (production) for each well and there are many producttype associated with each well.

well_depth     producttype     runno     sampledate     count     purity ...
1.25              shallow      1     01/01/2012     123     124.567
1.50              shallow      1     01/01/2012     150     140.257
...
...
...
100.25          dry              1     01/01/2012     075     124.888
.
...
...
0.025          deep       1     02/01/2012     234     200.456
0.050          deep      1     02/012012     123     200.234
....
depth-n      type-n      runno-n     date-n      count-n     purity-n

One of the use case is to plot a chart ,for all well_depth against average purity on y axis (many range of well_depths called bins). since well depth varying for different amount and there are millions of such rows , we bin the well depth to specific ranges that is 1 to 10 ,10 to 20 etc (depends on the min depth and max depth - maximum bin size is 100),and find average of purity for these bins and plot the chart .There are many filters to apply like product type,temperature etc .

Its better to do aggregation on the server side with specific filters and do grouping for each bin size  .As an example ,rather getting aggregated value for each product type ,there are multiple aggregation (average ) for each producttype according to the given ranges.

SELECT producttype ,runno,count,sum(purity) from tblPrudction where well_depth > 1 and well_depth < 2.5 and producttype in('shallow','deep') and temperature = 35 group by producttype,runno,count UNION ALL SELECT producttype ,runno,count,sum(purity) from tblPrudction where well_depth > 2.5 and well_depth < 5 and producttype in('shallow','deep') and temperature = 35 group by producttype,runno,count UNION ALL SELECT producttype ,runno,count,sum(purity) from tblPrudction where well_depth > 5 and well_depth < 7.5 and producttype in('shalloe','deep') and temperature = 35 group by producttype,runno,count   and so on..

The chart will have values in x axis - >1,2.5,5,7.5,10 ... and y axis (average(purity)) 120,130,250,100,135 ... The result of the query will be having 100 rows if there are 100 bins .

Using the skip scan (well_depth >1 and well_depth <2.5 or well_depth >2.5 and well_depth <5 or ..) will bring all rows and a single average for each producttype .My use-case actually needs to have multiple aggregation for single type ,that is multiple scan range and aggregation for each such scan

-Eldo

jtaylor-sfdc commented 11 years ago

Wow, thanks for all the detail! Interesting use case - I see now why you need separate ranges.

I think a good way to model this query would be to GROUP BY a well_depth "bucket". It'll be super efficient that way, and will give you one row per well_depth range like you need it. Can you get away with defining a uniform "step amount" for your well_depth, like every 2.5? You'll need a tweaked version of our TRUNC function that accepts DECIMAL types, something like BUCKET(,,). Have you seen my blog for how to add your own built-in function? Take a look at TruncFunction and RoundFunction.

Here's what your query would look like: SELECT producttype ,runno,count,sum(purity) from tblPrudction WHERE well_depth > 1 and well_depth < 10 and producttype in('shalloe','deep') and temperature = 35 GROUP BY bucket(well_depth, 0.0, 2.5),producttype,runno,count

Another advantage of this approach is that you'd be grouping by the row key columns, so the aggregation can take place without requiring additional memory during the region scan.

What do you think? Would this work? If you're up for creating this built-in function, I'll pull it into the distribution.

eldokk commented 11 years ago

Hi James , I think this model will solve the issue ,nice approach .I have another plan like holding phoenix table with pre-aggregated data so that we dont require union ,on the fly the data set is fetched.

I will be looking this approach after sometime so that it could be used as an alternative or even to make use of some other use-cases.

Thanks -Eldo

jtaylor-sfdc commented 11 years ago

Great! Let me know if you need any help writing the built-in function. I think maybe I'd call it BUCKETIZE, and I'd have it's return type be a PDataType.BIGINT. The argument could be DECIMAL.

jtaylor-sfdc commented 11 years ago

Hello @eldokk - are you still planning on pursuing this? Let me know if I can help.