Huawei-Hadoop / hindex

Secondary Index for HBase
Apache License 2.0
591 stars 286 forks source link

How to use hindex for scanning data? #49

Closed xuxc closed 10 years ago

xuxc commented 10 years ago

i have deployed Hadoop and hindex successfully, created table and inserted data , index table also existed, so ,how do i scan for special Qualifier which has index? like the code: get 'test','rowkey','Family:Qualifier','value' ?

hy2014 commented 10 years ago

just like before, if you add filter which include the index column, the hindex can get it. No need to add any code in client.

chrajeshbabu commented 10 years ago

Need not make any client changes to make use of index. Internally we have filter evaluator to check whether to make use of index or not.

xuxc commented 9 years ago

now i have used 3 filters to filter data ,and three cols all have index, but for 200W data ,it costs almost 40s ,i'd like to know whether indexes worked? \ @chrajeshbabu thank u.

anoopsjohn commented 9 years ago

Can u bit more clear your table schema and query? Total how many rows of data with you?

xuxc commented 9 years ago

the table with a CF:"info",and 17 cols under "info", and i create index on every col at the time i creating the table, when i use filter as this: List filters = new ArrayList();
Filter filter1 = new SingleColumnValueFilter(Bytes.toBytes("info"), Bytes.toBytes("style_No"), CompareOp.EQUAL, Bytes.toBytes("4674"));
filters.add(filter1);
Filter filter2 = new SingleColumnValueFilter(Bytes.toBytes("info"), Bytes.toBytes("country_No"), CompareOp.EQUAL, Bytes.toBytes("3871"));
filters.add(filter2);
FilterList filterList1 = new FilterList(filters);
Scan scan = new Scan();
scan.setFilter(filterList1);

hbase-site.xm is rightl:

hbase.rootdir hdfs://namenode:9000/hbase hbase.cluster.distributed true hbase.master hdfs://namenode:60000 hbase.tmp.dir /home/hadoop/tmp/data hbase.zookeeper.quorum namenode,datanode1,datanode2 hbase.zookeeper.property.dataDir ${hbase.tmp.dir}/zookeeper hbase.use.secondary.index true hbase.coprocessor.master.classes org.apache.hadoop.hbase.index.coprocessor.master.IndexMasterObserver hbase.coprocessor.region.classes org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionObserver hbase.coprocessor.wal.classes org.apache.hadoop.hbase.index.coprocessor.wal.IndexWALObserver

It's take almost 40 seconds to get result for 2million rows. i'm afraid the indexes doesn't work....

anoopsjohn commented 9 years ago

So the total data on which scan happened is 2 million and data satisfying the condition is less than that? Or the actual fetched data is 2 million. Just trying to know the data size. How big cluster? Total how many regions? So I assume you are using HFile block size as default ie. 64KB. Can try reducing that

xuxc commented 9 years ago

the cluster with just 3 nodes,. does hindex's 2nd filter select data(resultsets) from which be got after 1st filter? or every filter do full scan on index_table?

anoopsjohn commented 9 years ago

None of the filter do full scan on index table. So your indexed columns type is String only and you do equals condition. So for the index table scan we will create start and stop row. As this query covers 2 index, we will have 2 index scanners which retrieve data (at server side) simultaneously and using AND find the data rks. If there can be single index on both these columns that will be better any way. Just saying. Any idea you have, when there is no index usage, what time it will take to do the above query?

xuxc commented 9 years ago

i am sorry to put forward such unprofessional problem ..>.<, what i want to say is that :

  1. i first create 17 indexes on each col ,and create the table, 2.load 2million data to the table,and indexes worked; 3.i use filters to query by java API(just as methoded above ); 4.i found it need 40 sec to get the results.. May i have a code .java for reference?..Maybe the codes have something wrong...
anoopsjohn commented 9 years ago

So your total rows count is 2 million. Can you tell me how many rows satisfy above said condition (col1=? AND col2=?) Also any idea you have that when you don't declare any column for index (normal full table scan) what time it takes(?)

xuxc commented 9 years ago

within 10 rows satisfy above said condition,and it spends 40 seconds getting results.all cols have indexes..

xuxc commented 9 years ago

'qx', {NAME => 'info', DATA_BLOCK_ENCODING => 'NONE true
', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0',
VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIO
NS => '0', TTL => '2147483647', KEEP_DELETED_CELLS
=> 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fal
se', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'
}

'qx_idx', {METHOD => 'table_att', MAX_FILESIZE => ' true
9223372036854775807', CONFIG => {'SPLIT_POLICY' =>
'org.apache.hadoop.hbase.regionserver.ConstantSizeR
egionSplitPolicy'}}, {NAME => 'd', DATA_BLOCK_ENCOD
ING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_S
COPE => '0', COMPRESSION => 'NONE', VERSIONS => '3'
, TTL => '2147483647', MIN_VERSIONS => '0', KEEP_DE
LETED_CELLS => 'false', BLOCKSIZE => '65536', ENCOD
E_ON_DISK => 'true', IN_MEMORY => 'false', BLOCKCAC
HE => 'true'}

anoopsjohn commented 9 years ago

Only 10 rows and taking 40 secs seems too much ! Total how many regions in these 3 nodes? I doubt whether index is getting used or not ...

Do u have below way Have so many rows satisfying col1 condition alone And so many rows satisfying col2 condition And both together many be max 10

xuxc commented 9 years ago

yeah ,i think so , Maybe so many rows satisfying one of 2 condition ,but both together many be max 10. and it has 5 regions in 3 nodes. so i want to know if index works~

use filters: HTablePool pool = new HTablePool(configuration, 1000); List filters = new ArrayList();
Filter filter1 = new SingleColumnValueFilter(Bytes
.toBytes("info"), Bytes
.toBytes("style_No"), CompareOp.EQUAL, Bytes
.toBytes("4674"));
filters.add(filter1);
Filter filter2 = new SingleColumnValueFilter(Bytes
.toBytes("info"), Bytes
.toBytes("country_No"), CompareOp.EQUAL, Bytes
.toBytes("3871"));
filters.add(filter2); FilterList filterList1 = new FilterList(filters);
Scan scan = new Scan();
scan.setFilter(filterList1);
//ResultScanner rs = table.getScanner(scan);// hindex Filter ResultScanner rs = pool.getTable(tableName).getScanner(scan); {....code...} May i have your email and send u some pics? @hy2014 @chrajeshbabu @anoopsjohn

anoopsjohn commented 9 years ago

anoop.hbase@gmail.com

xuxc commented 9 years ago

i got the point, the index name is "contry_No",and when i loaded data into Hbase,the col name is "country_No"... BTW,i found a interesting thing, i new the filter: new SingleColumnValueFilter(Bytes
.toBytes("info"), Bytes
.toBytes("contry_No"), CompareOp.EQUAL, Bytes
.toBytes("8600"));

and the actual col is "info:country",but after full scan the table, it get the results correctly!! thank you , @anoopsjohn

anoopsjohn commented 9 years ago

So after correcting the name how long the query with usage of index takes? I hope it will much much lower than 40 sec.

xuxc commented 9 years ago

within 1 sec. hindex is so fast!! i have got the ResultScanner rs in Dao.java and "return" it, and i want to show data in rs in other page , but in the Action.java ,rs is not null,but there is also no Result r in it,whether ResultScanner can't be returned? code as follows: Dao.java------- { ................ rs=pool.getTable(tableName).getScanner(scan); return rs; }

Action.java-------- { QueryHbaseService qhsi=new QueryHbaseServiceImpl(); rs=qhsi.queryHbase(condMap); //call for Dao.java and return Rs for (Result r : rs) { //codes in " for clause" is undo. /* where is can't be reach */ for (KeyValue keyValue : r.raw()) { ... } }
}

anoopsjohn commented 9 years ago

There should not problem using the ResultScanner in one class or a passed in class.. Not sure what is the problem you are facing. Can you check at the logs in client and RS side?

xuxc commented 9 years ago

Hindex is prefect for indexing data in Hbase. And further tests will be done. thank u very much!