Huawei-Hadoop / hindex

Secondary Index for HBase
Apache License 2.0
591 stars 286 forks source link

Steps to implement hindex to my hbase cluster #50

Open kunkumar opened 10 years ago

kunkumar commented 10 years ago

I was able to build the project and run the map reduce bulk insert and load incremental file.

hbase org.apache.hadoop.hbase.index.mapreduce.IndexImportTsv hbase org.apache.hadoop.hbase.index.mapreduce.IndexLoadIncrementalHFile

But, something strange is happening, after the compeletion of process for just 6GB of data, the size of the hbase table keeps and keeps on increasing till 200 gb after which I had to shutdown the cluster.

Please suggest whats going wrong here ?

Thanks

hy2014 commented 10 years ago

may be you rowkey is too long, i think.

kunkumar commented 10 years ago

I have created a hase table and index table with hindex framework, but when we are uploading more data into same table, it keeps on increasing the size of index table only and no actual data is appearing in Hbase table. In this case my input data is 80 GB and the index table has grown to 200+ GB and no new data appearing in the main table.

Can rowkey size be a reason for such huge table size ?

hy2014 commented 10 years ago

index table rowkey contains the index column/value and user table rowkey. As you said, your user table data size has no change, so your index table affect data size.

SilentMing commented 9 years ago

Is there any detail description in how to implement hindex in an existing Cluster?

abhi-kr commented 9 years ago

For existing cluster, if you already have all required hbase-secondary index related configurations configured in your cluster machines(HMaster+Regionservers, else after making all configuration changes restart tour cluster) then you can make use of class "org.apache.hadoop.hbase.index.mapreduce.TableIndexer" to create index on existing user tables:

./hbase org.apache.hadoop.hbase.index.mapreduce.TableIndexer -Dtablename.to.index= -Dtable.columns.index='IDX1=>cf1:[q1->datatype&length];cf2:[q1->datatype&length],[q2->datatype&length],[q3->datatype& lenght]#IDX2=>cf1:q5,q5'

Here, tablename.to.index: Table name to create index. table.columns.index : Table columns on which index to be created.

The format used here is: IDX1 - Name of the Index given by user cf1 - Column family name of user table q1 - qualifier name datatype - datatype of column values "cf1:q1" [Int, String, Double, Float] length - Maximum length of the values of "cf1:q1"

is used to separate between two index details

SilentMing commented 9 years ago

Thanks for your kind answer, abhi-kr. I did it successfully.