forcedotcom / phoenix

BSD 3-Clause "New" or "Revised" License
558 stars 227 forks source link

Take advantage of region boundaries for min/max queries #55

Open ryang-sfdc opened 11 years ago

ryang-sfdc commented 11 years ago

Rowkey is led by date non null.

explain select max(date) from foobar where date < to_date('2013-02-17 00:00:00')

[["PLAN"], ["CLIENT PARALLEL 7-WAY RANGE SCAN OVER FOOBAR TO (2013-02-17 00:00:00) EXCLUSIVE"], [" SERVER FILTER BY FirstKeyOnlyFilter"], [" SERVER AGGREGATE INTO SINGLE ROW"]]

Does this plan imply we scan from the beginning of the table up to the specified endpoint?

jtaylor-sfdc commented 11 years ago

Yes, from beginning of table to specified end point of 2013-02-17. Were you expecting something else?

ryang-sfdc commented 11 years ago

This is probably too speculative, but could region boundaries be employed to provide a lower bound?

jtaylor-sfdc commented 11 years ago

I don't see how region boundaries help you in this case if date is not the leading part of the key. For example, if you have HOST CHAR(2), EVENT_DATE DATE as your schema, then the MAX(date) could be in any region and in any row, since the order depends on the HOST first. For example: AA date AB date - 1 AC date + 1 BA date - 2 ZZ date + 3

In the above case, the date could be anything, and the rows would still sort in that order. Stats only help in this case to balance the load on parallelization. We couldn't really rely on the max/min in the stats being the answer to a query, because it'll be updated asynchronously at some configurable time interval. Something may have snuck in as min/max after the last stats gathering was done.

jtaylor-sfdc commented 11 years ago

There is one case I can think of where region boundaries may be helpful. If the HOST value is more like an enum with a few limited values, then the same HOST value might repeat for multiple region boundaries. In that case, for something like MAX(event_date), you could skip regions where HOST repeats until the last one. For example, say the region boundaries are: NA 11111111 NA 22222222 NA 33333333 TX 11111111 TX 22222222 ZZ 11111111

You could skip the first two NA regions, since you'd know the MAX(event_date) would be in them. Then you could skip the first TX region too. etc.

ryang-sfdc commented 11 years ago

In this case, the rowkey is led by the date column. So ideally, only one region needs to be scanned, not the entire table up to the specified ceiling.

The only questionable part for me of depending on region boundaries is how to handle retrying speculation that's been made invalid by splits and whatnot.

jtaylor-sfdc commented 11 years ago

If date is leading the row key, we can definitely do a better job, but it would really only apply if max is the only thing being selected. Right now for your original query, we'd scan every region up to to_date('2013-02-17 00:00:00'). We really only need to look at the region containing to_date('2013-02-17 00:00:00'). Would be good to generalize this. I'll morph this issue into that.