Support Secondary Indexes

jtaylor-sfdc commented 11 years ago

Allow users to create indexes through a new CREATE INDEX DDL command and then behind the scenes build multiple projections of the table (i.e. a copy of the table using re-ordered or different row key columns). Phoenix will take care of maintaining the indexes when DML commands are issued and will choose the best table to use at query time.

guotielong commented 11 years ago

What time is available?

testn commented 11 years ago

Can you mention a bit more how we can implement this functionality? From what I read http://hadoop-hbase.blogspot.com/2012/10/musings-on-secondary-indexes.html, it looks like it is not easy to maintain a consistent index globally. I noticed that Jesse did some work with Culvert. Is that what you planned to incorporate?

jtaylor-sfdc commented 11 years ago

It'll be a phased approach: Phase 1 will include:

SQL command for creating/dropping an index. An index will just be another HBase table with the row key being the indexed column values plus primary row pk.
Time permitting, we'll support defining partially covered indexes, but at a minimum completely covered indexes (essentially multiple projects of the same table with a different row key).
Query processor enhancements to choose the best table to use based on the query (i.e. primary versus index table).
Incremental index maintenance as writes are done to the primary table. This will not be guaranteed consistent yet, so the client will need to deal with failures either through retry logic, invalidating the index and rebuilding it (time permitting from a given known good time stamp).

We've got some immutable data use cases where Map/Reduce jobs generate a Phoenix-compatible HFile and then the HFile is handed off to HBase. The above works fine for this scenario, since there's no incremental updates (i.e. the guaranteed consistency issue doesn't matter). We'd just generate one primary table HFile plus one HFile per index.

Phase 2 would include guaranteed consistent index maintenance. Jesse, Lars, and I have been talking about how to do this. We did some perf testing of their initial idea, but the batched gets to get back the data rows limits its usefulness (requiring a maximum selectivity of maybe 2-3%). Instead, Jesse and Lars have come up with a way of guaranteeing consistency at write time so that the reads aren't required to join back to the data table. They're still hashing through it a bit, but it seems to be gelling. Maybe we can get them to blog about it? @jyates @lhofhansl

HaisenLee commented 11 years ago

Now,our index way is so: The index row iconsists of four parts :

the start key of the indexed data table's region
the identifying of the index(the type of the value is integer,the index identifying of every column is sorted in the index metadata table ) 3.the value of the column indexed(the length of the value is fixed,string data will be transferred to long in MD5) 4.the row of a piece of data indexed If the situation the multi-columns index exsits,the 2nd and 3rd part are repeated.

We use the index metadata api to create and delete the index.We use the postPut method of the region observer to bulid the index, and the preDelete method of the region observer to delete the index data,and the postSplit of the region observer to control the index table and data table is split in the same time and update the index data.We use the postBalance method of the master observer to control the data region and the related index region in the same region server.

When a batch of data will be put or deleted,we use MR to build or delete indexes in the client-server way.

We use the endpoint way to query data with index.According to the query,we scan the index table to get the rows of data table.Then,we batch the gets with the rows to gain the query result.

May we join in the second index development?Thanks!

jtaylor-sfdc commented 11 years ago

Thank you, @HaisenLee for your offer to help with secondary indexing. We've already pretty far into the implementation of phase 1, and phase 2 require hbase core functionality to be added which Jesse and Lars are working on. Would you be interested in helping with any of other open issues?

HaisenLee commented 11 years ago

@jtaylor-sfdc Sure,I am pleasure to join in dealing the open issues.

Terry-Shi commented 11 years ago

Hi @jtaylor-sfdc I'm little curious why index talbe's rowkey need contain primary rowkey. Let's say there is a table called "EMP",this table has columns such as emp_name, age, dept_no ... and we want to create index on emp_name. assume index table is called IDX_EMP_emp_name. I think the mapping relation from EMP {e001, Tom, 31, d001} to IDX_EMP_emp_name{Tom, e001} should be pretty clear. I know there could be more than one "Tom" in the "EMP", because of that we need to save primary rowkey as qualifier name instead of value.

any thoughts ?

jtaylor-sfdc commented 11 years ago

Here are the client-side changes to add secondary index support:

Modify grammar to allow
```
CREATE INDEX <index_name>
ON <table_name> (<column_ref> [ASC|DESC], ...)
INCLUDE (<column_ref>...) fam_properties
```
Make sure you reuse the rules we already have to parse table_name optionally including the schema name, column_ref as optionally including the column family name, and fam_properties to pass through properties for the HBase table and column families.
Add a CreateIndexCompiler, similar to CreateTableCompiler and invoke this from PhoenixStatement. Also, add a ExecutableDropIndexStatement, similar to ExecutableDropTableStatement. I think the drop index can just set the INDEX_STATE of the metadata for the index to DROPPED - no need to put Delete markers in the index table. We can just enforce that index names must be unique (i.e. you can't reuse index names after deletion).
The CreateTableCompiler would call through to a createIndex call in MetaDataClient that
- Calls through to connectionQueryServices to create the hbase table that'll store the index. You'd create each column family referenced in any column_ref (follow the same model as the metaDataClient.createTable for this). Make sure to support the SALT_BUCKETS property as this will be particularly useful for indexes.
- Upserts the metadata of the index as a Phoenix table in a similar way that we create a regular table. You'll need to add a new PTableType of INDEX to differentiate it. Also, a new nullable column, INDEXED_TABLE, on SYSTEM.TABLE, only for the table row of an index table that stores the name of the table being indexed. Also, add a INDEX_STATE table that can track the index life-cycle: CREATED (initially), ACTIVE (after populated), INACTIVE (if/when it shouldn't be used,maybe because it's invalid), and DROPPED.
- To ensure that the index metadata is in the same region as the table, use the schema name and table name from the data table for the index metadata rows. Add a nullable INDEX_NAME column at the end of the PK to store the index_name so that the index rows are distinct from the table metadata rows (if we want to avoid updating existing table metadata, we'll need to account for the lack of a null byte at the end - I'll take care of this). See (5) below for why, but we want to ensure the index metadata is contiguous with the table metadata.
- The PK columns of the index are the columns being indexed, plus the PK of the data row as a trailing VARBINARY column. Use the same types as from the data table, but the column modifier will be gotten from the index DDL statement.
- The other columns will be the ones from the INCLUDE part of the DDL statement.
- Add a new MetaDataProtocol createIndex and dropIndex method. The implementation can reuse the code for building a table. It should also invalidate the data table row, so that clients get sent the updated PTable with the index on it.
Create a PTableIndex, derived from PTable for the client-side cache. Add a List in PTable to store the indexes and update the serialization/deserialization logic
Modify the private MetaDataEndPointImpl.getTable to pickup the index metadata along with the table metadata. The rows will be contiguous. The PTableIndex list will get passed through the PTable constructor.
Modify PostDDLCompiler to populate the index initially. You can just use an UPSERT SELECT statement for this. Run it asynchronously and mark the index metadata row as ACTIVE when done.

This should take care of the parsing, compiling, and caching of the index metadata. For the usage of it, we'll need to do the following:

In QueryCompiler, instead of building a single QueryPlan, you'll want to build one for each ACTIVE index of the PTable as well (you could not do this if it's fully qualified for the main data table). You'd basically replace the ColumnResolver that's created based on the FROM data_table with a ColumnResolver that represents the index table, as if the query was written as FROM index_table. You'll need to take into account if the query is using any columns that are not included in the index and then for now rule out using it (since we're not joining back from the index table to the data table yet). You could just catch a ColumnNotFoundException and ignore it and move on to the next index table, since that's what would happen. You'll likely want to create a new StatementContext for each iteration, since there's shared state there.
You'd pick the QueryPlan that has the longest start/stop row key defined. It would be good to form some abstraction around this by passing the QueryPlan list to a QueryOptimizer class that chooses the "best" one.

This should take care of the usage part of things. For the index maintenance, talk with @jyates. He's got the plumbing all worked out. Basically, there's an interface you need to implement where given the Put/Delete list from a data table, return the Put/Delete list for an index table. You'll also need to send over a bit of metadata in the Put/Delete operation to indicate any column qualifiers on the data table must be retrieved to build the Put/Delete list for the index. You likely need to send over the RowKeySchema too. Make sure that you delegate to a separate Phoenix class to figure out the list of mutations given a Put/Delete on the main table. The reason is that I'd like to provide a method in PhoenixRuntime, similar to getUncommittedData that gets the List<KeyValue> for each index table. This will provide a way of generating an HFile for map/reduce jobs for the index tables that'll be consistent with the data table.

rodrigopr commented 11 years ago

Seems great @jtaylor-sfdc, is it in an branch/fork already? Looking forward to see how it was implemented.

jyates commented 11 years ago

For the maintenance side, I'm working on getting a patch into HBase (HBASE-8636) so indexing can supporting a compressed WAL (should be in 0.94.9). After that goes in, I'll send up a pull request to phoenix.

Its nothing too fancy though - there is no magic. Its just hacking the WAL entries to get durability, but otherwise only has a passing adherance to ACID - it only meets the HBase expectations. Coming soon - promise!

jtaylor-sfdc commented 11 years ago

Request to @jyates - for the compressed WAL patch, can you make your index stuff not have a hard dependency on that? I'd like folks to be able to use the indexing with 0.94.4 and above. We can detect if it's pre 0.94.9 and throw if an index is attempted to be created on an HBase table that has compressed WAL enabled.

jyates commented 11 years ago

Yeah, I think we can do that - shouldn't be too hard.

jtaylor-sfdc commented 11 years ago

Couple more additions, @tonyhuang. Instead of using a VARBINARY, just use the data row PK columns as the are. That way you can just remove any that are already in the index and you won't have any duplication.

tonyhuang commented 11 years ago

Got you.

jtaylor-sfdc commented 11 years ago

One other consideration on deciding between the "right" query plan to choose: you'll want to consider the ORDER BY clause as well. If we're ordering by the leading indexed columns, we'll definitely want to use that index, even if there's no start/stop row key formed for that index.

jtaylor-sfdc commented 11 years ago

@mujtabachohan, @jyates, @simontoens, @elilevine, @anoopsjohn, @ramkrish86, @maryannxue, @ivarley @lhofhansl @ryang-sfdc @srau Phase 1 of secondary indexing is ready for testing. This basically includes:

Creating a composite, covered index and populating it based on the existing table data. This still needs some work to improve performance (see #281 and #282).
Building the index as the table is built. No incremental maintenance yet (but we're close - see #336), so you have to have a use case where only new rows are inserted as opposed to being updated in-place (which I believe covers your use cases, @simontoens, @elilevine, @ryang-sfdc).
Failure scenarios must be handled by the client. When a series of upserts are committed, both the table and index updates will be sent as a batch to the server. If any failures occur, an exception is thrown and it's up to the client to retry the batch until it succeeds.
Using the index for your data table query when the query processor deems that it'll be faster. We never join back from the index table to the data table, but instead rely on covered indexes. If your index doesn't contain all the columns referenced in a query, it won't be used.
Cascade deleting the index table when the data table is dropped.
Support for dropping an index individually.
Allowing the index to be disabled/enabled.
Support for accessing the index metadata through the standard JDBC [DatabaseMetaData#getIndexInfo](http://docs.oracle.com/javase/6/docs/api/java/sql/DatabaseMetaData.html#getIndexInfo(java.lang.String, java.lang.String, java.lang.String, boolean, boolean))
Showing whether or not an index will be used in the explain plan.

If you want a little example, take a look at our tests here and here.

I think at a minimum we need more testing prior to releasing (#321). Nice to have would be #281, #337, and #336, but I'd be ok with a release that just hardens what's there now, since it helps in our initial use cases. Also, we have a lot of other great stuff that I'd like to get out there in a release as well.

Thoughts?

simontoens commented 11 years ago

Hey James - that's awesome. We will give this a try!

apurtell commented 11 years ago

@jtaylor-sfdc Looks like a sensor network simulation would be a good test case for what's in place now.

Do you guys have a framework for larger scale integration tests? Or any thoughts on one? I've done some recent work with benchpress. For testing secondary indexing and other features on clusters with some test data heft, and realistic application simulations, please consider opening issues for brainstorming of such, and so will I.

jtaylor-sfdc commented 11 years ago

Support for secondary indexes over tables with immutable rows is in the 2.0.0 release. The incremental maintenance piece coming shortly will be traced by #336, so closing this issue.

forcedotcom / phoenix

Support Secondary Indexes #4