Closed mungeol closed 8 years ago
Could you explain what the expectation is and what is the actual result? es-hadoop does minimal interpretation of the index/type and feeds the information directly to Elasticsearch.
The hive table I created is like below
CREATE EXTERNAL TABLE test ( date timestamp, clientip string, request string ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES ( 'es.resource' = 'apache-2014.09.29/apache-access', -- or -- 'es.resource' = 'apache-2014.09.30/apache-access', 'es.mapping.names' = 'date:@timestamp' );
and I used 'select count() from test;' which is a hive query to count the total number of rows of the table. the result is same with ES count. the count result are 1454536 and 215564 for each apache-2014.09.29 and apache-2014.09.30 index then, I changed 'es.resource' = 'apache-2014.09.29/apache-access' to 'es.resource' = 'apache-2014.09./apache-access' or 'es.resource' = 'apache-2014.09.29,apache-2014.09.30/apache-access' for including multiple indexes. and I used 'select count(*) from test;' again to count the total number of documents of the indexes, but the result is different with ES count. the count result is 2919161 which should be 1670100 (1454536 + 215564).
environmental information
Hi,
Sorry for the delay in picking this up. I've tried reproducing this but can't - maybe it has something to do with the dataset or potentially the way the counting is done.
I'm not sure where that number (2919161
) is coming from - clearly some data is being returned but not properly processed.
Can you please confirm the following:
org.elasticsearch.hadoop
package (as mentioned in the docs). There will be a lot of data so please be patient - archive the results and let me know where you have uploaded them.Thanks!
Hi,
I created new indexes with small data for easy test. I indexed same three documents into each cars-01 and cars-02 indexes with same type name transactions
POST /cars-01/transactions/_bulk { "index": {}} { "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" } { "index": {}} { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" } { "index": {}} { "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
POST /cars-02/transactions/_bulk { "index": {}} { "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" } { "index": {}} { "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" } { "index": {}} { "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
'GET cars-01/_search?search_type=count' returns 3 hits 'GET cars-02/_search?search_type=count' returns 3 hits 'GET cars-*/_search?search_type=count' returns 6 hits
and created table like below
CREATE EXTERNAL TABLE cars ( price bigint, color string, make string, sold timestamp ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES ( 'es.resource' = 'cars-*/transactions' );
'select * from cars' and 'select count(*) from cars' returns different result every time example 1 (9 rows)
10000 red honda 2014-10-28 00:00:00 20000 red honda 2014-11-05 00:00:00 30000 green ford 2014-05-18 00:00:00 20000 red honda 2014-11-05 00:00:00 30000 green ford 2014-05-18 00:00:00 10000 red honda 2014-10-28 00:00:00 10000 red honda 2014-10-28 00:00:00 20000 red honda 2014-11-05 00:00:00 30000 green ford 2014-05-18 00:00:00
example 2 (10 rows)
10000 red honda 2014-10-28 00:00:00 20000 red honda 2014-11-05 00:00:00 30000 green ford 2014-05-18 00:00:00 10000 red honda 2014-10-28 00:00:00 10000 red honda 2014-10-28 00:00:00 10000 red honda 2014-10-28 00:00:00 20000 red honda 2014-11-05 00:00:00 30000 green ford 2014-05-18 00:00:00 20000 red honda 2014-11-05 00:00:00 30000 green ford 2014-05-18 00:00:00
I uploaded two logs. es_hadoop.log file includes logs after I added 'log4j.category.org.elasticsearch.hadoop.hive=TRACE' into log4j.properties file hive.log file includes logs after I changed 'hadoop.root.logger=INFO' to 'hadoop.root.logger=TRACE' here is the links for both logs https://raw.githubusercontent.com/mungeol/log/master/es_hadoop.log https://raw.githubusercontent.com/mungeol/log/master/hive.log hope it could help
Thanks.
I get this issue as well, but from a Spark perspective. I have two exactly identical indices. I use a query that return exactly 4 documents. If i set es.resource to be "index1/mydoc", everything works as expected:
If i set es.resource to be "index1,index2/mydoc", things get a bit wierd:
Moving forward, if i add an additional index, "index1,index2,index3/mydoc", the count will be 36, with 3 identical partitions of size 12 each.
Folks, can you try the latest Beta (4) and see whether it addressed your issue? There have been several updates on this front.
Thanks,
It is working now. Thanks,
Closing the issue...
'es.resource' = 'apache-2014.09./apache-access' or 'es.resource' = 'apache-2014.09.29,apache-2014.09.30/apache-access' are not working well for 'select count(*) from test' which is HiveQL. The count result is not right. 'es.resource' configuration should support multiple indexes setting. Or, at least give an error message.