elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.74k stars 24.68k forks source link

Sorting across the whole data set doesn't work when using a point-in-time search with slicing #101096

Open valasatava opened 11 months ago

valasatava commented 11 months ago

Elasticsearch Version

8.9.1

Installed Plugins

No response

Java Version

20.0.2

OS Version

5.15.0-83-generic elastic/elasticsearch#92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

I'm trying to pull results from Elasticsearch with a sort. There can be millions of documents and It's taking a very long time to fetch all of the results. I'm looking for ways to improve the speed.

I implemented sliced scrolls with PIT, and it improves the time, but the results are no longer really sorted. They are only sorted within their own slice, but I need the results to return in sort order.

For example, this search for slice 1

GET _search
{
  "slice": {
    "id": 1,
    "max": 5
  },
  "pit": {
    "id": "tOaGBAEXY29tYmluZWRfbW9sX2RlZmluaXRpb24WaVBkVG1IZ0NUeFNvT3gybmJXUVAxdwAWTUpXdDdzb1BUYXlUd1NSS0l4THFMUQAAAAAAAAOpjBZwLUZROWxiWVJycWtUQnRFWk1iek9nAAEWaVBkVG1IZ0NUeFNvT3gybmJXUVAxdwAA"
  },
  "_source": ["_none_"],
  "docvalue_fields": ["rcsb_id"], 
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "rcsb_id": {
        "order": "asc"
      }
    }
  ]
}

returns first document with ID "006"

{
  "pit_id": "tOaGBAEXY29tYmluZWRfbW9sX2RlZmluaXRpb24WaVBkVG1IZ0NUeFNvT3gybmJXUVAxdwAWTUpXdDdzb1BUYXlUd1NSS0l4THFMUQAAAAAAAAOpjBZwLUZROWxiWVJycWtUQnRFWk1iek9nAAEWaVBkVG1IZ0NUeFNvT3gybmJXUVAxdwAA",
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 8227,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "combined_mol_definition",
        "_id": "006-CHEM_COMP",
        "_score": null,
        "_source": {},
        "fields": {
          "rcsb_id": [
            "006"
          ]
        },
        "sort": [
          "006",
          38024
        ]
      },
      {
        "_index": "combined_mol_definition",
        "_id": "00B-CHEM_COMP",
        "_score": null,
        "_source": {},
        "fields": {
          "rcsb_id": [
            "00B"
          ]
        },
        "sort": [
          "00B",
          47562
        ]
      }

and for slice 2 - "001"

{
  "pit_id": "tOaGBAEXY29tYmluZWRfbW9sX2RlZmluaXRpb24WaVBkVG1IZ0NUeFNvT3gybmJXUVAxdwAWTUpXdDdzb1BUYXlUd1NSS0l4THFMUQAAAAAAAAOpjBZwLUZROWxiWVJycWtUQnRFWk1iek9nAAEWaVBkVG1IZ0NUeFNvT3gybmJXUVAxdwAA",
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 8257,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "combined_mol_definition",
        "_id": "001-CHEM_COMP",
        "_score": null,
        "_source": {},
        "fields": {
          "rcsb_id": [
            "001"
          ]
        },
        "sort": [
          "001",
          56945
        ]
      },
      {
        "_index": "combined_mol_definition",
        "_id": "003-CHEM_COMP",
        "_score": null,
        "_source": {},
        "fields": {
          "rcsb_id": [
            "003"
          ]
        },
        "sort": [
          "003",
          63266
        ]
      }

Steps to Reproduce

Step 1: mappings

{
  "mappings": {
       "rcsb_id": {
            "type": "keyword",
            "eager_global_ordinals": true,
            "fields": {
              "normalized": {
                "type": "keyword",
                "normalizer": "lowercase_normalizer"
              }
           }
        }
     }
}

Step 2: index creation

Step 3: opening point-in-time

Step 4: requesting slices

Logs (if relevant)

No response

elasticsearchmachine commented 11 months ago

Pinging @elastic/es-search (Team:Search)

benwtrent commented 3 months ago

This is a dumb question, but are you setting the search_after parameter in your subsequent search calls?

I would think you should do search_after and PIT, and don't bother doing the slicing.

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)