elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
889 stars 24.81k forks source link

Normalize the sort missing values #43338

Open fax4ever opened 5 years ago

fax4ever commented 5 years ago

Describe the feature:

Normalize the sort missing values if a normalizer is defined for the field.

Elasticsearch version (bin/elasticsearch --version):

7.0.0

Plugins installed: []

JVM version (java -version):

java version "1.8.0_171" Java(TM) SE Runtime Environment (build 1.8.0_171-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

OS version (uname -a if on a Unix-like system):

Linux new-host 5.0.6-100.fc28.x86_64 #1 SMP Wed Apr 3 16:14:34 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

About missing values provided in a sort predicates. If a normizer is defined on the field, those values are not normalized.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including (e.g.) index creation, mappings, settings, query etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it.

  1. Map a keyword field in which a normalizer has been defined
  2. Add some values, some of which null
  3. Query using sorting and providing some missing. You will find that the normalizer has not been applied to the provided missing values.

Provide logs (if relevant):

(1) The Mapping 
Executed Elasticsearch HTTP PUT request to path '/indexname'. 
Request body: <
{
  "settings": {
    "analysis": {
      "normalizer": {
        "DefaultAnalysisDefinitions_lowercase": {
          "type": "custom",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "byType_normalizedString": {
        "type": "keyword",
        "index": true,
        "norms": false,
        "doc_values": true,
        "store": false,
        "normalizer": "DefaultAnalysisDefinitions_lowercase"
      }
    },
    "dynamic": "strict"
  }
}

(2) The Indexing 
Executed Elasticsearch HTTP POST request to path '/_bulk'
Request body: <
{
  "index": {
    "_index": "indexname",
    "_id": "2"
  }
}
{
  "byType_normalizedString": "george"
}
{
  "index": {
    "_index": "indexname",
    "_id": "1"
  }
}
{
  "byType_normalizedString": "Cecilia"
}
{
  "index": {
    "_index": "indexname",
    "_id": "3"
  }
}
{
  "byType_normalizedString": "Stefany"
}
{
  "index": {
    "_index": "indexname",
    "_id": "empty"
  }
}
{}

(3) Force refresh

(4) The Quering 
Executed Elasticsearch HTTP POST request to path '/indexname/_search' with query parameters {size=10000, track_total_hits=true}
Request body: <
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "byType_normalizedString": {
        "order": "asc",
        "missing": "Daniel"
      }
    }
  ]
}
>. 
Response body: <
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "hits": [
      {
        "_index": "indexname",
        "_type": "_doc",
        "_id": "empty",
        "_source": {},
        "sort": [
          "Daniel"
        ]
      },
      {
        "_index": "indexname",
        "_type": "_doc",
        "_id": "1",
        "_source": {
          "byType_normalizedString": "Cecilia"
        },
        "sort": [
          "cecilia"
        ]
      },
      {
        "_index": "indexname",
        "_type": "_doc",
        "_id": "2",
        "_source": {
          "byType_normalizedString": "george"
        },
        "sort": [
          "george"
        ]
      },
      {
        "_index": "indexname",
        "_type": "_doc",
        "_id": "3",
        "_source": {
          "byType_normalizedString": "Stefany"
        },
        "sort": [
          "stefany"
        ]
      }
    ]
  }
}
>
elasticmachine commented 5 years ago

Pinging @elastic/es-search

jimczi commented 5 years ago

We discussed this internally and agreed that the behavior should be consistent with the normalization that is applied on term queries that target a keyword field. However we're unsure if the change should be treated as a bug or a breaking change since users might rely on the non-normalized value to define the missing value. While we agree that this behavior is not consistent and that we should normalize the value all the time we were also wondering for which use case this option was used because we expect _first and _last missing mode to be used more widely for a keyword field than a custom value ?

yrodiere commented 5 years ago

Hi,

I'm part of the same team as @fax4ever so I'll answer while he's away.

Long story short, our use case is a Java library (Hibernate Search) that exposes Elasticsearch's features through a different API, along with other, database-related features. So we don't have a specific use case, it's more that the behavior was inconsistent with what we expected, and that showed up in our integration tests.

That being said, if an actual use case is necessary, I can imagine a web UI displaying a sorted list where missing values for a keyword field are displayed as "Missing", even though they are indexed as null. If the customer requires that entries with this "Missing" label are just after "L..." and just before "N...", then using "missing": "Missing" would help. The developers could hard-code normalization in their app (use "missing": "missing"), but that's arguably not very developer-friendly.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine commented 3 months ago

Pinging @elastic/es-search-relevance (Team:Search Relevance)