elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
1.93k stars 989 forks source link

Nested objects fail parsing in Spark SQL when empty objects present #2157

Closed jbaiera closed 10 months ago

jbaiera commented 10 months ago

We keep track of which field we are currently parsing in the org.elasticsearch.spark.sql.ScalaRowValueReader#readValue method:

https://github.com/elastic/elasticsearch-hadoop/blob/4a14860391d00716a5225804a4c71c46a5633162/spark/sql-30/src/main/scala/org/elasticsearch/spark/sql/ScalaEsRowValueReader.scala#L39-L46

When reading an array of objects though, the current field that we are reading is overwritten between row creations. We get around this in the create array method by stashing the row order for an array on the call stack:

https://github.com/elastic/elasticsearch-hadoop/blob/4a14860391d00716a5225804a4c71c46a5633162/spark/sql-30/src/main/scala/org/elasticsearch/spark/sql/ScalaEsRowValueReader.scala#L76C55-L89

When we create a Row object we check the current field. If we're in an array, we try to use the stashed row order if the current field doesn't have one. If the current field does have one, then we use it instead because we're probably making a subobject under the array:

https://github.com/elastic/elasticsearch-hadoop/blob/4a14860391d00716a5225804a4c71c46a5633162/spark/sql-30/src/main/scala/org/elasticsearch/spark/sql/ScalaEsRowValueReader.scala#L60-L69

Unfortunately, if we are parsing a nested document that has an empty document at the very end of it, the empty document field name will remain in the current field variable on the parser. When the next object in the array is created, it will pick up the column list for the previous empty object, which results in downstream serialization issues:

{
  "nested": [          // Current field: `nested`
    {                  // Current field: `nested` (creates map for `nested`)
      "key": "value",  // Current field: `nested.key`
      "object": {}     // Current field: `nested.object` (creates map for `nested.object`)
    },
    {                  // Current field: `nested.object` (creates map for `nested.object` but should have created map for `nested`)
      "key": "value"
    }
  ]
}

This isn't a problem if the object has fields because the underlying fields wont have object mappings unless they too are empty objects:

{
  "nested": [             // Current field: `nested`
    {                     // Current field: `nested` (creates map for `nested`)
      "key": "value",     // Current field: `nested.key`
      "object": {         // Current field: `nested.object` (creates map for `nested.object`)
        "subkey": "value" // Current field: `nested.object.subkey`
      }
    },
    {                     // Current field: `nested.object.subkey` (creates map for `nested` using stashed row order because `nested.object.subkey` has no column order data)
      "key": "value"
    }
  ]
}