elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.25k stars 24.86k forks source link

Fields API does not reflect actual state of indexed documents when ignore_above changes #80495

Open timroes opened 3 years ago

timroes commented 3 years ago

While testing something I found a potential weird behavior in the field API that I wanted to clarify on whether this is intended behavior or not.

Create a simple index & index a document:

PUT ignored_above
{
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword"
      }
    }
  }
}

PUT ignored_above/_doc/1
{
  "name": "Douglas Adams"
}

If you request data from that index using the following query the result looks like expected:

GET ignored_above/_search
{
  "fields": ["*"]
}

// Returns one hit looking like:
{
  "_index" : "ignored_above",
  "_id" : "1",
  "_score" : 1.0,
  "_source" : {
    "name" : "Douglas Adams"
  },
  "fields" : {
    "name" : [
      "Douglas Adams"
    ]
  }
}

Now change the ignore_above setting of this field:

PUT ignored_above/_mapping
{
  "properties": {
    "name": {
      "type": "keyword",
      "ignore_above": 5
    }
  }
}

Executing the same _search as above, will now return a different result:

{
  "_index" : "ignored_above",
  "_id" : "1",
  "_score" : 1.0,
  "_source" : {
    "name" : "Douglas Adams"
  },
  "ignored_field_values" : {
    "name" : [
      "Douglas Adams"
    ]
  }
}

It seems that the name value for this document, since above 5, will now no longer be returned from the fields part, but instead from the ignored_field_values, which suggests that the value was not indexed. That is though not true and you can actually search perfectly fine by it, since changing the ignore_above of a field does not change anything around the already indexed documents as far as I understand:

GET ignored_above/_search
{
  "fields": ["*"],
  "query": {
    "match": {
      "name": "Douglas Adams"
    }
  }
}

This will return the same document as above, even though that result suggests that the name field had no indexed value, but only an ignored one. I personally found that behavior a bit confusing, since I thought one of the intentions of the fields API was to give us a better insight into the actual "indexed" state of a document, which it does not do in this case.

Especially confusing is, that this field, despite it's value not being returned as fields but as ignored_field_values is not listed under _ignored (since it's not actually ignored). If you index the same document a 2nd time (after changing ignore_above and thus no actually ignoring that value when indexing), you will end up with two documents, where the first one has no _ignored part, but still a value under ignored_field_values, (that is no actually ignored):

{
  "hits" : [
    {
      "_index" : "ignored_above",
      "_id" : "1",
      "_score" : 1.0,
      "_source" : {
        "name" : "Douglas Adams"
      },
      "ignored_field_values" : {
        "name" : [
          "Douglas Adams"
        ]
      }
    },
    {
      "_index" : "ignored_above",
      "_id" : "2",
      "_score" : 1.0,
      "_ignored" : [
        "name"
      ],
      "_source" : {
        "name" : "Douglas Adams"
      },
      "ignored_field_values" : {
        "name" : [
          "Douglas Adams"
        ]
      }
    }
  ]
}

Is this behavior intended and if so, documented somewhere?

elasticmachine commented 3 years ago

Pinging @elastic/es-search (Team:Search)

jtibshirani commented 3 years ago

@timroes this is indeed surprising behavior, that the _ignored and ignore_field_values sections can give conflicting information. I'm wondering if we should check the _ignored section when creating the ignore_field_values response -- if the field is not present in _ignored, we can omit it from ignore_field_values. @markharwood do you have any thoughts on this idea?

In our team discussion, we also raised the question of whether we should really allow ignore_above to be updated. Elasticsearch's behavior around ignored values would be simpler if this parameter couldn't be changed. This would be a bigger and more long-term discussion than the idea above.

timroes commented 3 years ago

Thanks for the clarification. Just to make sure I am understanding the initial thought correctly. When you say that a field should not be in ignore_field_values when it's not also in _ignored, that would mean for this specific case with ignore_above having changed, that it would still appear under fields, since the value actually is indexed?

jtibshirani commented 3 years ago

Yes, in this case the value would be included in fields instead of ignored_field_values.

markharwood commented 3 years ago

do you have any thoughts on this idea?

TL/DR: We can only work with the current rules

We don't store enough data about historical ingest failures

Reverse-engineering why certain values weren't ingested is hard. All we have is a list of ignored field names and JSON source with potentially many values in arrays (some good values, some bad).Figuring out which of the values might have been rejected and why is hard to know, especially if you are allowed to change the mapping rules after ingest. All we have to work with are the current validation rules and the source.

Fields retrieval logic runs values through the current rules.

The support for ignored_field_values simply hooked into the existing try....catch...ignore section of code in the fields api where it retrieves and parses values from source at query time. We just picked up the values that were otherwise being silently dropped in the ignore part of their exception handling and added the bad values to the results in the new ignored_field_values section.

elasticsearchmachine commented 4 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)