elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
10 stars 989 forks source link

User _source instead of fields parameter to support nested objects. #130

Closed nahap closed 10 years ago

nahap commented 10 years ago

Until now elasticsearch-hadoop uses the fields parameter in the query to elasticsearch to choose the fields used in the columns (or the aliases thereof). this worked fine until elasticsearch 1.0.0. beta2, but stopped working for elasticsearch 1.0.0rc1 when using fields that had a nested structure (I am not talking about nested mapping, i realize that is not supported yet). this causes an exception (ElasticsearchIllegalArgumentException[field [categories] isn't a leaf field]}) if categories is a map or any other kind of nested structure. to solve this problem, it makes sense to use the _source parameter instead. I am creating a pull request that solves the problem for me (hive only) but am not sure if it is feature complete, more of an example for you to look at.

costin commented 10 years ago

Can you post an example (Es mapping and Hive table)?

nahap commented 10 years ago

Here is a simplified form. examples document (using all default mapping, no mapping defined by me): { "id": 1230912, "min_price": 10, "name": "bla", "categories": { "0": [ [ 1, 720, 758, 781 ], [ 2, 3, 4 ] ], "1": [ [ 284 ] ] }, "default_variant": { "attributes": { "attributes_1": [ 49, 53 ] } } }

and the hive table:

CREATE EXTERNAL TABLE external_sources.elastic_search_products( product_id BIGINT, min_price BIGINT, name STRING, virtual_categories MAP<STRING,ARRAY<ARRAY>>, default_variant MAP<STRING,MAP<STRING,ARRAY>>) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'searching/product/_search?q=*', 'es.nodes'='my_ip_address', 'es.port'='9200', 'es.nodes.discovery' = 'false', 'es.mapping.names' = 'virtual_categories:categories,product_id:id,default_variant:default_variant.attributes.attributes_1');

costin commented 10 years ago

To confirm - are you using M1 or master? By the looks of it, it seems you are using M1 - I highly recommend trying out master. Since you are not creating the mapping - what's your workflow? Do you index the data through other means and are trying to read it in Hive or do you do write/read entirely through Hive?

nahap commented 10 years ago

I am using master and building it myself. my current workflow is indexing happens over a rabbitmq river and we read only with hive. we have another elasticsearch server that has a mapping for the same data, but it uses the nested mapping which would fail because not support in elasticsearch-hadoop, so we write the documents into rabbitmq and from there feed them to two different servers one with mapping (used for live site) and one without mapping (only for reading in hive)

Thanks Andy On 02.02.2014, at 18:27, Costin Leau notifications@github.com wrote:

To confirm - are you using M1 or master? By the looks of it, it seems you are using M1 - I highly recommend trying out master. Since you are not creating the mapping - what's your workflow? Do you index the data through other means and are trying to read it in Hive or do you do write/read entirely through Hive?

— Reply to this email directly or view it on GitHub.

costin commented 10 years ago

Just to clarify, the issue in your case is that when using a mapping with dot notation (hivecolumn=some.es.field), the fields are not properly retrieved? Is correct?

nahap commented 10 years ago

there are actually two problems:

  1. when i use the dot notation (see example default_variant.attributes.attributes_1) the fields are not retrieved.
  2. if on the same field i want to retrieve the map for example default_variant then i get the error ElasticsearchIllegalArgumentException[field [default_variant] isn't a leaf field]}]" https://github.com/elasticsearch/elasticsearch-hadoop/pull/131 shows how i got it to work using _source instead.

On 02.02.2014, at 19:46, Costin Leau notifications@github.com wrote:

Just to clarify, the issue in your case is that when using a mapping with dot notation (hivecolumn=some.es.field), the fields are not properly retrieved? Is correct?

— Reply to this email directly or view it on GitHub.

nahap commented 10 years ago

this is sort of what the elasticseach query created looks like: for

  1. /searching/product/_search?q=*&fields=default_variant.attributes.attributes_1
    1. /searching/product/_search?q=*&pretty=true&fields=default_variant
costin commented 10 years ago

I'm looking into what I think the issue is but as a side note, maybe it's the late hour on my end but it's hard to follow your posts; I'm not sure what you mean by your last comment:

  1. /searching/product/_search?q=*&fields=default_variant.attributes.attributes_1
    1. /searching/product/_search?q=*&pretty=true&fields=default_variant

The markdown is intended to help with formatting but it seems in this case, it has the opposite effect.

costin commented 10 years ago

Fwiw, I've found the issue fields issue and looking into resolving it; thanks for reporting it. The nested field declaration however seemed to be working fine - try using it on a 'leaf' key. And by the way, your hive declaration is incorrect - defaultvariant.attributes.attributes_1 in your example points to a list (or Hive array not a map).

nahap commented 10 years ago

Im sorry I confused you. this last comment was an example of the get requests to elasticsearch with the fields parameter. it wasnt really necessary. do you need more information from me? if you try the example document and the hive mapping in the issue, you will see both problems occurring.

On 02.02.2014, at 21:30, Costin Leau notifications@github.com wrote:

I'm looking into what I think the issue is but as a side note, maybe it's the late hour on my end but it's hard to follow your posts; I'm not sure what you mean by your last comment:

• /searching/product/search?q=_&fields=defaultvariant.attributes.attributes1 • /searching/product/search?q=&pretty=true&fields=defaultvariant The markdown is intended to help with formatting but it seems in this case, it has the opposite effect.

— Reply to this email directly or view it on GitHub.

nahap commented 10 years ago

sorry, it was also late for me :) you are right it should have been an an array.

On 02.02.2014, at 23:18, Costin Leau notifications@github.com wrote:

Fwiw, I've found the issue fields issue and looking into resolving it; thanks for reporting it. The nested field declaration however seemed to be working fine - try using it on a 'leaf' key. And by the way, your hive declaration is incorrect - defaultvariant.attributes.attributes_1 in your example points to a list (or Hive array not a map).

— Reply to this email directly or view it on GitHub.

costin commented 10 years ago

Hi,

I've pushed a fix to master which fixes this issue for Cascading, Hive and Pig and works against both ES 1.0.RC1 or higher and ES 0.90. Please try it out and let me know whether it works for you as well.

Thanks!

nahap commented 10 years ago

nice! fast reaction. I will try it out tomorrow. thanks a lot.

nahap commented 10 years ago

Hi costin, unfortunately, this is still not working properly. when i post following document (default mappings): curl -XPUT 'http://test.com:9200/test_index/product/1' -d '{ "id": 1, "min_price": 3490, "default_variant": { "ean": "4049502068848", "id": 5130405, "default": false, "price": 3490, "old_price": 0, "images": [], "quantity": 6, "attributes": { "attributes_172": [ 1967 ], "attributes_173": [ 2295 ], "attributes_1": [ 15 ], "attributes_0": [ 846 ], "attributes_2": [ 42 ], "attributes_206": [ 2432 ] }, "retail_price": 0 }, "brand_id": 846, "active": false, "max_price": 3490 }'

with following hive table: CREATE EXTERNAL TABLE test.products( product_id BIGINT, brand_id BIGINT, active BOOLEAN, min_price BIGINT, max_price BIGINT, default_variant ARRAY) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'test_index/product/_search?q=*', 'es.nodes'='test.com', 'es.port'='9200', 'es.nodes.discovery' = 'false', 'es.mapping.names' = 'product_id:id,default_variant:default_variant.attributes.attributes_1'); I expect "default_variant" in the hive table to contain an array of ints, not a single int. This is happening because you are unwrapping an array with a single element in JDKValueReader. But when using "_source" , the values are not wrapped in array by elasticsearch, as opposed to fields, so you need make the unwrapping also dependent on whether you are using source or fields parameter.

if i post the following (notice the difference in default_variant.attributes.attributes_1 which has two elements now), I get the int array as expected, because it is not unwrapped.

curl -XPUT 'http://10.10.61.10:9200/test_index/product/1' -d '{ "id": 1, "min_price": 3490, "default_variant": { "ean": "4049502068848", "id": 5130405, "default": false, "price": 3490, "old_price": 0, "images": [], "quantity": 6, "attributes": { "attributes_172": [ 1967 ], "attributes_173": [ 2295 ], "attributes_1": [ 15,16 ], "attributes_0": [ 846 ], "attributes_2": [ 42 ], "attributes_206": [ 2432 ] }, "retail_price": 0 }, "brand_id": 846, "active": false, "max_price": 3490 }'

Thanks Andy

nahap commented 10 years ago

Another thing. If using the hive table above I post another document with the "default_variant" missing: curl -XPUT 'http://10.10.61.10:9200/test_index/product/2' -d '{ "id": 2, "min_price": 3490, "brand_id": 846, "active": false, "max_price": 3490 }'

then i get a nullpointerexception: 14/02/04 11:57:25 ERROR CliDriver: Failed with exception java.io.IOException:java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:551) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:489) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:136) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1472) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:271) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) Caused by: java.lang.NullPointerException at org.elasticsearch.hadoop.hive.EsSerDe.hiveFromWritable(EsSerDe.java:190) at org.elasticsearch.hadoop.hive.EsSerDe.deserialize(EsSerDe.java:91) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:535) ... 14 more

costin commented 10 years ago

I'm afraid I can't reproduce your issue - take a look at HiveSearchJsonTest#loadNestedField - the document contains a nested array of integers which is returned as is. Also I'm not clear of what exactly is failing in your tests - you describe what you think the issue is in es-hadoop but not the incorrect behaviour. If I understand correctly, if the array contains only one element you get an exception? If so can you post what that is? Does this happen on 1.0.0.RC1+ or 0.90?

P.S. To increase the readability of your post, please see the markdown help: https://help.github.com/articles/github-flavored-markdown

nahap commented 10 years ago

Sorry for the confusing report i guess i am overworked.

If you create an index with only this document: (type: product)

 {
    "id": 1,
    "default_variant": {
        "attributes": {
            "attributes_1": [
                15
            ]
        }
    }
 }

And following hive definition

CREATE EXTERNAL TABLE test.products(
    product_id BIGINT,
    default_variant ARRAY<INT>)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'test_index/product/_search?q=*',
              'es.nodes'='test.com',
              'es.port'='9200',
              'es.nodes.discovery' = 'false',
              'es.mapping.names' = 'product_id:id,default_variant:default_variant.attributes.attributes_1');

then call: select * from test.products; I get following exception:

14/02/04 13:29:11 ERROR CliDriver: Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.ArrayWritable
java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.ArrayWritable
    at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:551)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:489)
    at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:136)
    at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1472)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:271)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.ArrayWritable
    at org.elasticsearch.hadoop.hive.EsSerDe.hiveFromWritable(EsSerDe.java:149)
    at org.elasticsearch.hadoop.hive.EsSerDe.hiveFromWritable(EsSerDe.java:193)
    at org.elasticsearch.hadoop.hive.EsSerDe.deserialize(EsSerDe.java:91)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:535)
    ... 14 more

I am using only 1.0.0.RC1 and a build of elasticsearch-hadoop from the current master branch.

costin commented 10 years ago

Hi,

Could you please try the latest master? It should fix the issue of wrapping/unwrapping (when dealing with arrays of only one item) but also the NPE in case you have documents with different structures under the same index.

Thanks!

nahap commented 10 years ago

Great! it works like a charm now, thanks!

costin commented 10 years ago

Excellent - can you confirm whether both the NPE and the array wrapping/unwrapping are fixed?

Cheers!

nahap commented 10 years ago

Yes. both problems are fixed. On 04.02.2014, at 15:08, Costin Leau notifications@github.com wrote:

Excellent - can you confirm whether both the NPE and the array wrapping/unwrapping are fixed?

Cheers!

— Reply to this email directly or view it on GitHub.

costin commented 10 years ago

Great - thanks again for reporting this and taking the time to bear with me to get to the root of the problem!

Cheers again!

costin commented 10 years ago

@nahap to confirm - have you encountered any other issues with master? I'd like to cut a release today and want to make sure I'm not missing anything.

Cheers!

nahap commented 10 years ago

actually yes i do have another issue but i wont be able to report it properly today. Was plannin on reporting tomorrow but the gist of it is that wenn you define a multifield mapping with no default field then you get a nullpointerexception wenn selecting from the table. I will write a detailed issue tomorrow. Sorry that i cant make it today. Greets andy

Sent from my iPhone

On 04.02.2014, at 16:25, Costin Leau notifications@github.com wrote:

@nahap to confirm - have you encountered any other issues with master? I'd like to cut a release today and want to make sure I'm not missing anything.

Cheers!

— Reply to this email directly or view it on GitHub.

costin commented 10 years ago

Hi,

I've been trying to reproduce the issue to no avail. Can you please file a new report, with the ES mapping (curl definitions are even better) and the Hive script + failing error?

Thanks!