elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.12k stars 24.83k forks source link

Synthetic source is wrong for arrays of flattened fields containing subfields longer than ignore_above #112044

Closed lkts closed 1 month ago

lkts commented 2 months ago

Elasticsearch Version

8.15

Installed Plugins

No response

Java Version

bundled

OS Version

x

Problem Description

Synthetic source is wrong for arrays of flattened fields containing fields longer than ignore_above.

Steps to Reproduce

PUT /my_index
{
  "mappings": {
    "_source": { "mode": "synthetic" },
    "properties": {
      "n": {
        "type": "flattened",
        "ignore_above": 4
      }
    }
  }
}

PUT my_index/_doc/1
{
  "n": [
    {
      "foo": "bar"
    },
    {
      "foo": "bazzzzzzz"
    }
  ]
}

GET my_index/_doc/1

Produces:

"_source": {
    "n": {
        "foo": "bazzzzzzz"
    }
}

Logs (if relevant)

No response

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-storage-engine (Team:StorageEngine)

lkts commented 1 month ago

It looks like flattened field type in general does not support correctly displaying arrays in synthetic source. Synthetic source is generated using doc_values of keyed field (._keyed) which encode a path to the field inside an object and its value in one byte array. The problem is that during parsing all object fields are simply added to the keyed field even if multiple objects are parsed in one document (e.g. with arrays or object arrays one level up in the document). During construction of synthetic source doc_values are retrieved and written, resulting in one giant object containing combined fields from all flattened values indexed.

Some options:

  1. Disallow indexing multiple flattened values as we do for other field types. That seems quite strict but it is not clear what is the real usage of that.
  2. Always set synthetic_source_keep to arrays for flattened fields. That does not however solve the problem of arrays being present on a higher level in the document.
  3. Always use fallback synthetic source for flattened. Not ideal for disk space given we have most of the data in doc_values already. If it's rare enough could be okay.
  4. Change doc_values format somehow to add missing information to distinguish different objects (not clear how).
  5. Document it and leave it as is since it does not impact reindexing. Still needs a fix to synthetic source to handle ignore_above correctly.
lkts commented 1 month ago

Example (with current state of code):

Expected: 

{
    "field": [
        {
            "KOGtOnvgpw": "PHU",
            "gcvFjmPHFd": "KwbkSyLlC"
        },
        {
            "BYliNOBHKM": {
                "XVaROQmSKP": "dYfCP",
                "ZaApOr": [
                    "1074156",
                    "1129404",
                    "1204799",
                    "1348011",
                    "1723590",
                    "183559",
                    "448895"
                ]
            },
            "CnxMeelQhJ": "P",
            "kPUVedTaPY": "CWOLm"
        }
    ]
}

but: was 

{
    "field": {
        "BYliNOBHKM": {
            "XVaROQmSKP": "dYfCP",
            "ZaApOr": [
                "1074156",
                "1129404",
                "1204799",
                "1348011",
                "1723590",
                "183559",
                "448895"
            ]
        },
        "CnxMeelQhJ": "P",
        "KOGtOnvgpw": "PHU",
        "gcvFjmPHFd": "KwbkSyLlC",
        "kPUVedTaPY": "CWOLm"
    }
}

Note how KOGtOnvgpw is a field of the separate object but gets merged into one object in synthetic source.

lkts commented 1 month ago

Or a repro:

PUT my-index
{
  "mappings": {
    "_source": { "mode": "synthetic" },
    "properties": {
      "f": {
        "type": "flattened"
      }
    }
  }
}

GET my-index

POST my-index/_bulk?refresh
{ "create": {} }
{ "f": [ { "a": "a" }, { "b": "b" } ] }

POST my-index/_search
--------------------------
"f": {
    "a": "a",
    "b": "b"
}
salvatore-campagna commented 1 month ago

I think this is the way synthetic source handles arrays of objects...which is "arrays are moved to leaves". See https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html#synthetic-source-modifications-leaf-arrays