Huawei-Hadoop / hindex

Secondary Index for HBase
Apache License 2.0
591 stars 286 forks source link

Indexing arbitrary column qualifier #53

Closed anoopsjohn closed 10 years ago

anoopsjohn commented 10 years ago

This is req from alan.wu@oracle.com. Raising this issue after offline discussion with him

HBase allows arbitrary column names and users might be using it. When there are condition based query on these columns (cf:q) the indexing can help.

How can we do

We can allow a special index, in which column type can be marked as BINARY. Name can be special name. When storing the value of column into the index table, we can prepend the actual qualifier name also to the value. I believe we already removed the padding mechanism. On selection of the index at read time, if there is no index clearly on the column with same column name, we can go with this new special index.

Any one up for the impl? I can help with the impl? Or else I can work on this in 2015 Q1

anoopsjohn commented 10 years ago

Started working on this.

anoopsjohn commented 10 years ago

A basic impl is ready. Will commit once all the existing tests are run and pass with the change. Waiting for your response @chrajeshbabu

chrajeshbabu commented 10 years ago

@anoopsjohn The patch you have send cannot apply. Can you make patch once again after rebase and send me please? Thanks.

anoopsjohn commented 10 years ago

A basic impl is done and pushed to hbase-98 branch. A single arbitrary index can be created on a table. User can pass all the cfs of this table which need to be indexed. During writes we will index each and every arbitrary Q in that cfs. Each of the cf:q will be indexed with one entry to index table. During Scan based on the condition, the arbitrary index also will get used. There are still some more TODOs During puts only we will add index entries. Deletion of index data on table data delete (for arbitrary index) is not done in this version. (Well this wont create any incorrect results on Scan) Only SCVF condition with equals condition is supported on arbitrary index now. Range condition support is yet to add Some more clear validations required on table create/modify

Also one more limitation is when a table is having arbitrary index added on it, we don't allow creating any other index on this same table. It can be supported in later versions. The use case might not be really there I believe. Because the Q names on the table are arbitrary any way.