Closed nahap closed 10 years ago
Can you post an example (Es mapping and Hive table)?
Here is a simplified form. examples document (using all default mapping, no mapping defined by me): { "id": 1230912, "min_price": 10, "name": "bla", "categories": { "0": [ [ 1, 720, 758, 781 ], [ 2, 3, 4 ] ], "1": [ [ 284 ] ] }, "default_variant": { "attributes": { "attributes_1": [ 49, 53 ] } } }
and the hive table:
CREATE EXTERNAL TABLE external_sources.elastic_search_products(
product_id BIGINT,
min_price BIGINT,
name STRING,
virtual_categories MAP<STRING,ARRAY<ARRAY
To confirm - are you using M1 or master? By the looks of it, it seems you are using M1 - I highly recommend trying out master. Since you are not creating the mapping - what's your workflow? Do you index the data through other means and are trying to read it in Hive or do you do write/read entirely through Hive?
I am using master and building it myself. my current workflow is indexing happens over a rabbitmq river and we read only with hive. we have another elasticsearch server that has a mapping for the same data, but it uses the nested mapping which would fail because not support in elasticsearch-hadoop, so we write the documents into rabbitmq and from there feed them to two different servers one with mapping (used for live site) and one without mapping (only for reading in hive)
Thanks Andy On 02.02.2014, at 18:27, Costin Leau notifications@github.com wrote:
To confirm - are you using M1 or master? By the looks of it, it seems you are using M1 - I highly recommend trying out master. Since you are not creating the mapping - what's your workflow? Do you index the data through other means and are trying to read it in Hive or do you do write/read entirely through Hive?
— Reply to this email directly or view it on GitHub.
Just to clarify, the issue in your case is that when using a mapping with dot notation (hivecolumn=some.es.field), the fields are not properly retrieved? Is correct?
there are actually two problems:
On 02.02.2014, at 19:46, Costin Leau notifications@github.com wrote:
Just to clarify, the issue in your case is that when using a mapping with dot notation (hivecolumn=some.es.field), the fields are not properly retrieved? Is correct?
— Reply to this email directly or view it on GitHub.
this is sort of what the elasticseach query created looks like: for
I'm looking into what I think the issue is but as a side note, maybe it's the late hour on my end but it's hard to follow your posts; I'm not sure what you mean by your last comment:
The markdown is intended to help with formatting but it seems in this case, it has the opposite effect.
Fwiw, I've found the issue fields
issue and looking into resolving it; thanks for reporting it. The nested field declaration however seemed to be working fine - try using it on a 'leaf' key.
And by the way, your hive declaration is incorrect - defaultvariant.attributes.attributes_1
in your example points to a list (or Hive array
not a map
).
Im sorry I confused you. this last comment was an example of the get requests to elasticsearch with the fields parameter. it wasnt really necessary. do you need more information from me? if you try the example document and the hive mapping in the issue, you will see both problems occurring.
On 02.02.2014, at 21:30, Costin Leau notifications@github.com wrote:
I'm looking into what I think the issue is but as a side note, maybe it's the late hour on my end but it's hard to follow your posts; I'm not sure what you mean by your last comment:
• /searching/product/search?q=_&fields=defaultvariant.attributes.attributes1 • /searching/product/search?q=&pretty=true&fields=defaultvariant The markdown is intended to help with formatting but it seems in this case, it has the opposite effect.
— Reply to this email directly or view it on GitHub.
sorry, it was also late for me :) you are right it should have been an an array.
On 02.02.2014, at 23:18, Costin Leau notifications@github.com wrote:
Fwiw, I've found the issue fields issue and looking into resolving it; thanks for reporting it. The nested field declaration however seemed to be working fine - try using it on a 'leaf' key. And by the way, your hive declaration is incorrect - defaultvariant.attributes.attributes_1 in your example points to a list (or Hive array not a map).
— Reply to this email directly or view it on GitHub.
Hi,
I've pushed a fix to master which fixes this issue for Cascading, Hive and Pig and works against both ES 1.0.RC1 or higher and ES 0.90. Please try it out and let me know whether it works for you as well.
Thanks!
nice! fast reaction. I will try it out tomorrow. thanks a lot.
Hi costin, unfortunately, this is still not working properly. when i post following document (default mappings): curl -XPUT 'http://test.com:9200/test_index/product/1' -d '{ "id": 1, "min_price": 3490, "default_variant": { "ean": "4049502068848", "id": 5130405, "default": false, "price": 3490, "old_price": 0, "images": [], "quantity": 6, "attributes": { "attributes_172": [ 1967 ], "attributes_173": [ 2295 ], "attributes_1": [ 15 ], "attributes_0": [ 846 ], "attributes_2": [ 42 ], "attributes_206": [ 2432 ] }, "retail_price": 0 }, "brand_id": 846, "active": false, "max_price": 3490 }'
with following hive table:
CREATE EXTERNAL TABLE test.products(
product_id BIGINT,
brand_id BIGINT,
active BOOLEAN,
min_price BIGINT,
max_price BIGINT,
default_variant ARRAY
if i post the following (notice the difference in default_variant.attributes.attributes_1 which has two elements now), I get the int array as expected, because it is not unwrapped.
curl -XPUT 'http://10.10.61.10:9200/test_index/product/1' -d '{ "id": 1, "min_price": 3490, "default_variant": { "ean": "4049502068848", "id": 5130405, "default": false, "price": 3490, "old_price": 0, "images": [], "quantity": 6, "attributes": { "attributes_172": [ 1967 ], "attributes_173": [ 2295 ], "attributes_1": [ 15,16 ], "attributes_0": [ 846 ], "attributes_2": [ 42 ], "attributes_206": [ 2432 ] }, "retail_price": 0 }, "brand_id": 846, "active": false, "max_price": 3490 }'
Thanks Andy
Another thing. If using the hive table above I post another document with the "default_variant" missing: curl -XPUT 'http://10.10.61.10:9200/test_index/product/2' -d '{ "id": 2, "min_price": 3490, "brand_id": 846, "active": false, "max_price": 3490 }'
then i get a nullpointerexception: 14/02/04 11:57:25 ERROR CliDriver: Failed with exception java.io.IOException:java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:551) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:489) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:136) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1472) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:271) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) Caused by: java.lang.NullPointerException at org.elasticsearch.hadoop.hive.EsSerDe.hiveFromWritable(EsSerDe.java:190) at org.elasticsearch.hadoop.hive.EsSerDe.deserialize(EsSerDe.java:91) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:535) ... 14 more
I'm afraid I can't reproduce your issue - take a look at HiveSearchJsonTest#loadNestedField
- the document contains a nested array of integers which is returned as is.
Also I'm not clear of what exactly is failing in your tests - you describe what you think the issue is in es-hadoop but not the incorrect behaviour.
If I understand correctly, if the array contains only one element you get an exception? If so can you post what that is? Does this happen on 1.0.0.RC1+ or 0.90?
P.S. To increase the readability of your post, please see the markdown help: https://help.github.com/articles/github-flavored-markdown
Sorry for the confusing report i guess i am overworked.
If you create an index with only this document: (type: product)
{
"id": 1,
"default_variant": {
"attributes": {
"attributes_1": [
15
]
}
}
}
And following hive definition
CREATE EXTERNAL TABLE test.products(
product_id BIGINT,
default_variant ARRAY<INT>)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'test_index/product/_search?q=*',
'es.nodes'='test.com',
'es.port'='9200',
'es.nodes.discovery' = 'false',
'es.mapping.names' = 'product_id:id,default_variant:default_variant.attributes.attributes_1');
then call: select * from test.products; I get following exception:
14/02/04 13:29:11 ERROR CliDriver: Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.ArrayWritable
java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.ArrayWritable
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:551)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:489)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:136)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1472)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:271)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.ArrayWritable
at org.elasticsearch.hadoop.hive.EsSerDe.hiveFromWritable(EsSerDe.java:149)
at org.elasticsearch.hadoop.hive.EsSerDe.hiveFromWritable(EsSerDe.java:193)
at org.elasticsearch.hadoop.hive.EsSerDe.deserialize(EsSerDe.java:91)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:535)
... 14 more
I am using only 1.0.0.RC1 and a build of elasticsearch-hadoop from the current master branch.
Hi,
Could you please try the latest master? It should fix the issue of wrapping/unwrapping (when dealing with arrays of only one item) but also the NPE in case you have documents with different structures under the same index.
Thanks!
Great! it works like a charm now, thanks!
Excellent - can you confirm whether both the NPE and the array wrapping/unwrapping are fixed?
Cheers!
Yes. both problems are fixed. On 04.02.2014, at 15:08, Costin Leau notifications@github.com wrote:
Excellent - can you confirm whether both the NPE and the array wrapping/unwrapping are fixed?
Cheers!
— Reply to this email directly or view it on GitHub.
Great - thanks again for reporting this and taking the time to bear with me to get to the root of the problem!
Cheers again!
@nahap to confirm - have you encountered any other issues with master? I'd like to cut a release today and want to make sure I'm not missing anything.
Cheers!
actually yes i do have another issue but i wont be able to report it properly today. Was plannin on reporting tomorrow but the gist of it is that wenn you define a multifield mapping with no default field then you get a nullpointerexception wenn selecting from the table. I will write a detailed issue tomorrow. Sorry that i cant make it today. Greets andy
Sent from my iPhone
On 04.02.2014, at 16:25, Costin Leau notifications@github.com wrote:
@nahap to confirm - have you encountered any other issues with master? I'd like to cut a release today and want to make sure I'm not missing anything.
Cheers!
— Reply to this email directly or view it on GitHub.
Hi,
I've been trying to reproduce the issue to no avail. Can you please file a new report, with the ES mapping (curl definitions are even better) and the Hive script + failing error?
Thanks!
Until now elasticsearch-hadoop uses the fields parameter in the query to elasticsearch to choose the fields used in the columns (or the aliases thereof). this worked fine until elasticsearch 1.0.0. beta2, but stopped working for elasticsearch 1.0.0rc1 when using fields that had a nested structure (I am not talking about nested mapping, i realize that is not supported yet). this causes an exception (ElasticsearchIllegalArgumentException[field [categories] isn't a leaf field]}) if categories is a map or any other kind of nested structure. to solve this problem, it makes sense to use the _source parameter instead. I am creating a pull request that solves the problem for me (hive only) but am not sure if it is feature complete, more of an example for you to look at.