elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
10 stars 989 forks source link

Incompatible types found in multi-mapping #1074

Closed EdyKnopfler closed 6 years ago

EdyKnopfler commented 7 years ago

What kind an issue is this?

Feature description

Once upon a time I needed to change the mapping of some fields. Luckily, I use to put a date suffix in index names: myindex-yyyy.mm.dd, so I could to change the mapping from one day to another. I didn't yet have any problems on querying directly on Elasticsearch. But, using elasticsearch-spark, I got this message:

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Incompatible types found in multi-mapping: Field [offset] has conflicting types of [LONG] and [TEXT].

It occurs when I use a wildcard on index names: myindex-*

My request is: if the library can detect conflicting types, maybe it could just generate a warning and bring the field data as an Object. If the field is not important in a query (e.g. for filtering), it could come to me as it is.

jbaiera commented 7 years ago

This is a jam that's come about with recently fixing how we merge mappings. Previously, when running against multiple indices, only ONE mapping would be selected from the available mappings, at random no less, and would be applied to the data. Fields not in that mapping would be dropped, and data that did not match it would throw exceptions.

That said, we now collect all mappings and attempt to resolve them into one view since any integration that uses a strict typing (like SparkSQL, Hive, etc...) will need them resolved. Perhaps it makes sense to build out a field type promotion hierarchy - If you have a Long and a Text field, just merge them with a warning using Text as the more general field type. Thoughts?

dmarkhas commented 6 years ago

@jbaiera I think it would be better to follow Kibana's behavior - fields with conflicting mapping should be ignored (with a logged warning).

jbaiera commented 6 years ago

@dmarkhas I feel like ignoring the field would be a good default response. I'd still like to enable users a bit more than that though. Perhaps it might make sense to allow users to list fields that they expect to have conflicts and denote how to resolve the conflict.

StephenOTT commented 6 years ago

@jbaiera Field type promotion hierarchy would be great!

Would be great to see this at the logstash level as well.

XciD commented 6 years ago

Any update on this ? Could we create the mapping by taking care of the excludes fields in order to avoid the exception ?

daniel-wepunkt commented 6 years ago

I'm also facing this issue during migration from ES5 to ES6 in a production use case. Would be glad to hear about any news concerning this issue. Thanks!

Srinathc commented 6 years ago

Facing this issue with ES 5.5.1 and Spark 2.1.0. +1 to field type promotion.

mkaranasou commented 6 years ago

+1 ES 5.6.5 and Spark 2.3.0 (pyspark)

peasfarmer commented 6 years ago

+1 same problems, reindex all indexs is very hard job, it need spend very very long time

stupidxian commented 6 years ago

Facing this issue with ES 6.0.0 and Spark 2.2.0. +1 to field type promotion.

stupidxian commented 6 years ago

I have an immature solution.Comment out the following code,and it will not check the mapping type. You can do this when you don't need this field which is incompatible types. The program will execute normally.

git

Jrmyy commented 6 years ago

@starrysky-xian , I tried your solution, but when trying to get data from ES, I got as an error:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ArrayIndexOutOfBoundsException: 1
stupidxian commented 6 years ago

@Jrmyy i know a little about python. i think you may get data from the field which incompatible types.please mail me the detailed exception.

jbaiera commented 6 years ago

This is finally done - moving forward with a type hierarchy to determine the best common type to cast the field type to. As it stands - almost all numeric types can be cast to their higher precision counterparts, as can they be casted to string types. All textual data types can be cast to keyword types. Structural types like objects or nested fields will not be supported for casting at this time, as they are quite a bit more complex.