forcedotcom / phoenix

BSD 3-Clause "New" or "Revised" License
558 stars 227 forks source link

Gather and maintain stats for HBase tables in a designated HBase table #64

Open jtaylor-sfdc opened 11 years ago

jtaylor-sfdc commented 11 years ago

Our current stats gathering is way too simplistic - it's only keeping a cache per client connection to a cluster for the min and max key for a table. Instead, we should:

  1. have a system table that stores the stats
  2. create a coprocessor that updates the stats during compaction (i.e. using the preCompactSelection, postCompactSelection, preCompact, postCompact methods)
  3. keep a kind of histogram - the key boundary of every N bytes within a region. Perhaps we can do a delta update on minor compaction and a complete update on major compaction.
  4. keep the min key/max key of a table in the stats table too
tonyhuang commented 11 years ago

Hi Jesse, when you finish an rc for this ticket, could you inform me?

Thanks Tony

testn commented 11 years ago

Do you think we can optimize the query better if we have the cardinality information in the table? If so, hyperloglog might be a good choice.

jtaylor-sfdc commented 11 years ago

Wow, that HyperLogLog is pretty interesting - thanks for the pointer. For stats, we're calculating it at major compression where a full pass is made through the data anyway, so I don't think it'll help there. But for COUNT DISTINCT and SELECT DISTINCT, it could definitely be useful.

testn commented 11 years ago

It will only give out the cardinality but not the unique value itself. I'm thinking whether we can implement the combination of HyperLogLog and BloomFilter at the column value itself to determine the strategy to aggregate the data. If so, that would be awesome.