elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
1.93k stars 990 forks source link

Multiple indexes setting for 'es.resource' #289

Closed mungeol closed 8 years ago

mungeol commented 10 years ago

'es.resource' = 'apache-2014.09./apache-access' or 'es.resource' = 'apache-2014.09.29,apache-2014.09.30/apache-access' are not working well for 'select count(*) from test' which is HiveQL. The count result is not right. 'es.resource' configuration should support multiple indexes setting. Or, at least give an error message.

costin commented 10 years ago

Could you explain what the expectation is and what is the actual result? es-hadoop does minimal interpretation of the index/type and feeds the information directly to Elasticsearch.

mungeol commented 10 years ago

The hive table I created is like below

CREATE EXTERNAL TABLE test ( date timestamp, clientip string, request string ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES ( 'es.resource' = 'apache-2014.09.29/apache-access', -- or -- 'es.resource' = 'apache-2014.09.30/apache-access', 'es.mapping.names' = 'date:@timestamp' );

and I used 'select count() from test;' which is a hive query to count the total number of rows of the table. the result is same with ES count. the count result are 1454536 and 215564 for each apache-2014.09.29 and apache-2014.09.30 index then, I changed 'es.resource' = 'apache-2014.09.29/apache-access' to 'es.resource' = 'apache-2014.09./apache-access' or 'es.resource' = 'apache-2014.09.29,apache-2014.09.30/apache-access' for including multiple indexes. and I used 'select count(*) from test;' again to count the total number of documents of the indexes, but the result is different with ES count. the count result is 2919161 which should be 1670100 (1454536 + 215564).


environmental information

costin commented 10 years ago

Hi,

Sorry for the delay in picking this up. I've tried reproducing this but can't - maybe it has something to do with the dataset or potentially the way the counting is done. I'm not sure where that number (2919161) is coming from - clearly some data is being returned but not properly processed. Can you please confirm the following:

Thanks!

mungeol commented 10 years ago

Hi,

I created new indexes with small data for easy test. I indexed same three documents into each cars-01 and cars-02 indexes with same type name transactions

POST /cars-01/transactions/_bulk { "index": {}} { "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" } { "index": {}} { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" } { "index": {}} { "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }

POST /cars-02/transactions/_bulk { "index": {}} { "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" } { "index": {}} { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" } { "index": {}} { "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }

'GET cars-01/_search?search_type=count' returns 3 hits 'GET cars-02/_search?search_type=count' returns 3 hits 'GET cars-*/_search?search_type=count' returns 6 hits

and created table like below

CREATE EXTERNAL TABLE cars ( price bigint, color string, make string, sold timestamp ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES ( 'es.resource' = 'cars-*/transactions' );

'select * from cars' and 'select count(*) from cars' returns different result every time example 1 (9 rows)

10000 red honda 2014-10-28 00:00:00 20000 red honda 2014-11-05 00:00:00 30000 green ford 2014-05-18 00:00:00 20000 red honda 2014-11-05 00:00:00 30000 green ford 2014-05-18 00:00:00 10000 red honda 2014-10-28 00:00:00 10000 red honda 2014-10-28 00:00:00 20000 red honda 2014-11-05 00:00:00 30000 green ford 2014-05-18 00:00:00

example 2 (10 rows)

10000 red honda 2014-10-28 00:00:00 20000 red honda 2014-11-05 00:00:00 30000 green ford 2014-05-18 00:00:00 10000 red honda 2014-10-28 00:00:00 10000 red honda 2014-10-28 00:00:00 10000 red honda 2014-10-28 00:00:00 20000 red honda 2014-11-05 00:00:00 30000 green ford 2014-05-18 00:00:00 20000 red honda 2014-11-05 00:00:00 30000 green ford 2014-05-18 00:00:00

I uploaded two logs. es_hadoop.log file includes logs after I added 'log4j.category.org.elasticsearch.hadoop.hive=TRACE' into log4j.properties file hive.log file includes logs after I changed 'hadoop.root.logger=INFO' to 'hadoop.root.logger=TRACE' here is the links for both logs https://raw.githubusercontent.com/mungeol/log/master/es_hadoop.log https://raw.githubusercontent.com/mungeol/log/master/hive.log hope it could help

Thanks.

romanmar1 commented 9 years ago

I get this issue as well, but from a Spark perspective. I have two exactly identical indices. I use a query that return exactly 4 documents. If i set es.resource to be "index1/mydoc", everything works as expected:

  1. The count of the rdd is 4.
  2. collecting partitions reveals that there is a single partition of size 4.

If i set es.resource to be "index1,index2/mydoc", things get a bit wierd:

  1. The count is 16 (instead of the expected 8).
  2. There are two partitions, each of size 8, with the same documents.

Moving forward, if i add an additional index, "index1,index2,index3/mydoc", the count will be 36, with 3 identical partitions of size 12 each.

costin commented 9 years ago

Folks, can you try the latest Beta (4) and see whether it addressed your issue? There have been several updates on this front.

Thanks,

mungeol commented 9 years ago

It is working now. Thanks,

costin commented 8 years ago

Closing the issue...