elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
1.93k stars 990 forks source link

Dots in field names exception #853

Open jimmyjones2 opened 8 years ago

jimmyjones2 commented 8 years ago

spark-1.6.2-bin-hadoop2.6, elasticsearch-5.0.0-beta1, elasticsearch-hadoop-5.0.0-beta1

curl -XPOST localhost:9200/test4/test -d '{"b":0,"e":{"f.g":"hello"}}'
./bin/pyspark --driver-class-path=../elasticsearch-hadoop-5.0.0-beta1/dist/elasticsearch-hadoop-5.0.0-beta1.jar
>>> df1 = sqlContext.read.format("org.elasticsearch.spark.sql").load("test4/test")
>>> df1.printSchema()
root
 |-- b: long (nullable = true)
 |-- e: struct (nullable = true)
 |    |-- f: struct (nullable = true)
 |    |    |-- g: string (nullable = true)

>>> df1.show()
---8<--- snip ---8<--- 
org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'e.f.g' not found in row; typically this is caused by a mapping inconsistency
    at org.elasticsearch.spark.sql.RowValueReader$class.addToBuffer(RowValueReader.scala:45)
    at org.elasticsearch.spark.sql.ScalaRowValueReader.addToBuffer(ScalaEsRowValueReader.scala:14)
    at org.elasticsearch.spark.sql.ScalaRowValueReader.addToMap(ScalaEsRowValueReader.scala:94)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:806)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:696)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:806)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:696)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:466)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:391)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:286)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:259)
    at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:365)
    at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
    at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
jbaiera commented 8 years ago

Thanks for opening this. I was able to reproduce this and it's definitely a bug. When inserting the document, the mapping on the type recognizes the f.g field as a subfield (the following):

"mappings" : {
  "test" : {
    "properties" : {
      "b" : {
        "type" : "long"
      },
      "e" : {
        "properties" : {
          "f" : {
            "properties" : {
              "g" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

When we discover the mapping in ES-Hadoop we correctly parse a schema as follows:

StructType(
  StructField(b,LongType,true),
  StructField(e,StructType(StructField(f,StructType(StructField(g,StringType,true)),true)),true)
)

However, when reading the values from the scroll, the original source value is used:

{
  "_index" : "test4",
  "_type" : "test",
  "_id" : "AVdvSwQTnlURBA5E_yk9",
  "_score" : 1.0,
  "_source" : {
    "b" : 0,
    "e" : {
      "f.g" : "hello"
    }
  }
}

During the document parsing, the parser gets confused as it cannot find a valid schema field by the name of f.g. The reader will have to be updated to handle dots in field names by splitting them into a separate map layer each field.

sandeep-telsiz commented 7 years ago

Is there a workaround for this issue ?

jbaiera commented 7 years ago

@sandeep-telsiz unfortunately this is not a straightforward fix. The parsing code for reading scrolls needs to be completely rebuilt to support this. Rest assured that this is a big item on our radar that we hope to tackle soon.

liveangel-js commented 5 years ago

same issue

danielyahn commented 5 years ago

+1

liansghuaifan commented 4 years ago

+1

vladimirkore commented 4 years ago

+1

upupfeng commented 4 years ago

same issue

SouhailBenAli1 commented 4 years ago

+1 ! Any workaround for this issue ?

vsethi13 commented 4 years ago

Any workaround in the meantime? @jbaiera

jbaiera commented 4 years ago

As mentioned above, this is not a straightforward change to make as it will require a large rewrite of the document parsing code. One possible workaround is configuring the library to read data from Elasticsearch as raw JSON data and performing the parsing yourself before operating on the data. Unfortunately, this workaround would only be feasible on MR and Spark where you can run arbitrary code on the data.

robinsonmhj commented 4 years ago

+1

arjansh commented 4 years ago

+1

QuentinAdt commented 3 years ago

+1

masseyke commented 2 years ago

As @jbaiera has said, this one is complicated. When we write {"f.g":"hello"} as a field, elasticsearch treats f as an object field and g as a text field within that. So the mapping is the same as if you had written {"f":{"g":"hello"}}. And internally it's indexed the same way. But elasticsearch keeps the original source. When es-hadoop queries the data it gets the same {"f.g":"hello"} that was written to elasticsearch. Right now the parser blows up on that. I believe we can fix the parser to work with this, but that means effectively translating the source from {"f.g":"hello"} to {"f":{"g":"hello"}}. The reason is that that is what spark is expecting since that is what the mapping declares. This means that if we write the data back into elasticsearch, we write the changed source. It means the same thing to elasticsearch, but it could possibly break some application that is depending on the format of the source. So "fixing" this one (at least without some more thought) might actually do more harm than not fixing it.

masseyke commented 2 years ago

After discussing with the team, we decided that the behavior of es-hadoop would be dangerous if it were to support dots in field names because es-hadoop would be silently rewriting _source data. So I've put up a PR to document that we do not, and provide a better error message.

masseyke commented 2 years ago

Here is the ticket where dot support was added (back) to Elasticsearch -- https://github.com/elastic/elasticsearch/issues/15951.

masseyke commented 2 years ago

I'm closing this as one that we will intentionally not fix. Unfortunately the safest option is to use something like the Dot Processor to convert your field names with dots to object structures.

asapegin commented 2 years ago

I'm closing this as one that we will intentionally not fix. Unfortunately the safest option is to use something like the Dot Processor to convert your field names with dots to object structures.

Sorry, but what do you mean under "convert your field names with dots"?! The fields with dots are YOUR (Elastic) standard field names defined in the ECS Field Reference (https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html). 99% of all field names contains dots there.

asapegin commented 2 years ago

But then it will be needed to convert all related SIEM functionality in Elastic Security then, including rules, detections, alerts, .siem-signals index, etc., etc.

jbaiera commented 2 years ago

I am going to go ahead and re-open this since it seems like this "problem" of dots in field names is less of a "problem" and more just where things are trending toward in the data integration space. It would be unwise of us to ignore this issue given recent developments across existing solutions.

That said, this issue is not an easy fix and requires some adjusting of invariants that we have treated very carefully over the years - most notably that _source is sacred and should only be changed judiciously. Additionally, document update logic likely will need looking at (just try running a partial document update using normalized JSON in the request against a document containing dotted field names).

tsikerdekis commented 7 months ago

Any updates on this issue? I have ingestors that pull data off of suricata's eve.json and contain . fields for all sorts of default and non-default dashboards on kibana. I can't just rename them and not sure how to prevent pyspark from nagging about these fields. It makes elasticsearch-hadoop unusable unless someone has found a workaround.