elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.71k stars 24.67k forks source link

Paging support for aggregations #4915

Closed aaneja closed 7 years ago

aaneja commented 10 years ago

Terms aggregation does not support a way to page through the buckets returned. To work around this, I've been trying to set 'min_doc_count' to limit the buckets returned and using a 'exclude' filter, to exclude already 'seen' pages.

Will this result in better running time performance on the ES cluster as compared to getting all the buckets and then doing my own paging logic client side ?

jpountz commented 10 years ago

Paging is tricky to implement because document counts for terms aggregations are not exact when shard_size is less than the field cardinality and sorting on count desc. So weird things may happen like the first term of the 2nd page having a higher count than the last element of the first page, etc.

Regarding your question, terms aggregations run in two phases on the shard-level: first they compute counts for every possible term, and then they pick the top shard_size ones. Increasing size (or shard_size) only makes the 2nd step more costly. Given that the runtime of the first step is linear with the number of matched documents and that the runtime of the 2nd step is O(#unique_values * log(shard_size)), if you only have a limited number of unique values compared to the number of matched documents, doing the paging on client-side would be more efficient. On the other hand, on high-cardinality-fields, your first approach based on an exclude would probably be better.

As a side-note, min_doc_count has no effect on runtime performance when it is greater than or equal to 1. Only min_doc_count=0 is more costly given that it requires Elasticsearch to also fetch terms that are not contained in any match.

haschrisg commented 10 years ago

@jpountz would storing the results of an aggregation in a new index be feasible? In general, it'd be great to have a way of dealing with both aggregations with high cardinality, and nested aggregations that produce a large number (millions) of results -- even if the cost of that is that they're not sorted properly when paging.

jpountz commented 10 years ago

If it makes sense for your use-case, this is something that you could consider implementing on client-side, by running hourly/daily these costly aggregations, storing the result in an index and using this index between two runs to explore the results of the aggregation?

apatrida commented 10 years ago

When sorting by term instead of count, why would paging then not be possible? For example, having a terms aggregation, with top hits aggregation which could produce an overly large result set without having paging on the terms aggregation. Not all aggregations wants want to sort by count.

tugberkugurlu commented 10 years ago

I can see that this may not be possible but for a top_hits aggregation, I really need this functionality. I have the below aggregation query:

POST sport/_search
{
  "size": 0,
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "defense_strength": {
                  "lte": 83.43
                }
              }
            },
            {
              "range": {
                "forward_strength": {
                  "gte": 91
                }
              }
            }
          ]
        }
      }
    }
  }, 
  "aggs": {
    "top_teams": {
        "terms": {
          "field": "primaryId"
        },
        "aggs": {
          "top_team_hits": {
            "top_hits": {
              "sort": [
                {
                    "forward_strength": {
                        "order": "desc"
                    }
                }
              ],
              "_source": {
                  "include": [
                      "name"
                  ]
              },
              "from": 0,
              "size" : 1
            }
          }
        }
      }
    }
  }
}

This produces the below result for an insanely cheap index (with low number of docs):

    {
         "took": 2,
         "timed_out": false,
         "_shards": {
                "total": 5,
                "successful": 5,
                "failed": 0
         },
         "hits": {
                "total": 5,
                "max_score": 0,
                "hits": []
         },
         "aggregations": {
                "top_teams": {
                     "buckets": [
                            {
                                 "key": "541afdfc532aec0f305c2c48",
                                 "doc_count": 2,
                                 "top_team_hits": {
                                        "hits": {
                                             "total": 2,
                                             "max_score": null,
                                             "hits": [
                                                    {
                                                         "_index": "sport",
                                                         "_type": "football_team",
                                                         "_id": "y6jZ31xoQMCXaK23rPQgjA",
                                                         "_score": null,
                                                         "_source": {
                                                                "name": "Barcelona"
                                                         },
                                                         "sort": [
                                                                98.32
                                                         ]
                                                    }
                                             ]
                                        }
                                 }
                            },
                            {
                                 "key": "541afe08532aec0f305c5f28",
                                 "doc_count": 2,
                                 "top_team_hits": {
                                        "hits": {
                                             "total": 2,
                                             "max_score": null,
                                             "hits": [
                                                    {
                                                         "_index": "sport",
                                                         "_type": "football_team",
                                                         "_id": "hewWI0ZpTki4OgOeneLn1Q",
                                                         "_score": null,
                                                         "_source": {
                                                                "name": "Arsenal"
                                                         },
                                                         "sort": [
                                                                94.3
                                                         ]
                                                    }
                                             ]
                                        }
                                 }
                            },
                            {
                                 "key": "541afe09532aec0f305c5f2b",
                                 "doc_count": 1,
                                 "top_team_hits": {
                                        "hits": {
                                             "total": 1,
                                             "max_score": null,
                                             "hits": [
                                                    {
                                                         "_index": "sport",
                                                         "_type": "football_team",
                                                         "_id": "x-_YBX5jSba8qsEuB8guTQ",
                                                         "_score": null,
                                                         "_source": {
                                                                "name": "Real Madrid"
                                                         },
                                                         "sort": [
                                                                91.34
                                                         ]
                                                    }
                                             ]
                                        }
                                 }
                            }
                     ]
                }
         }
    }

What I need here is the ability to get first 2 aggregation result and get the other 2 (in this case, only 1) in other request.

missingpixel commented 9 years ago

If paging aggregations is not possible, how do we use ES for online stores where products of different colours are grouped together? Or, what if there are five million authors in the example at: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/top-hits.html ? Aggregate them and perform pagination in-memory?

If that's not possible, what else can be done in place of grouping in Solr?

Thank you

adrienbrault commented 9 years ago

A parameter allowing to hide the first X buckets from the response would be nice.

adrienbrault commented 9 years ago

@clintongormley Why was this issue closed ?

bobbyhubbard commented 9 years ago

Reopen please?

mikelrob commented 9 years ago

+1 for pagination while sorted on term not doc count

android-programmer commented 9 years ago

+1

daniel1028 commented 9 years ago

Can you re-open this please?

I understand that aggregation pagination will create performance issue in larger numbers of records. But it will not affect smaller numbers of records right?

The performance issue will be happen only if have more records.Why don't we have this support at least for smaller set of records.

Why do we have to hesitate to add this support considering larger amount of data? If we have this support , it would be very helpful for us to paginate smaller amount data.

May be we can inform users, this support will be efficient only for smaller amount data. Whenever the amount for data increases ,the performance will hit highly.

clintongormley commented 9 years ago

We have been against adding pagination support to the terms (and related) aggregations because it hides the cost of generating the aggregations from the user. Not only that, it can produce incorrect ordering because term based aggregations are approximate.

That said, we support pagination on search requests, which are similarly costly (although accurate).

While some users will definitely shoot themselves in the foot with pagination (eg https://github.com/elasticsearch/elasticsearch/issues/4915#issuecomment-61253054), not supporting pagination does limit some legitimate use cases.

I'll reopen this ticket for further discussion.

byronvoorbach commented 9 years ago

I would love to see this feature being added to ES, but I understand the cost at which it would come. I'm currently working for a client which needed such a feature, but since it didn't exist yet we solved it with doing 2 queries:

The first query has a terms aggregation on our field on which we want grouping and orders the aggregation based on the doc.score. We set the size of the aggregation to 0, so that we get all buckets for that query. We then parse the result and get the keys from the buckets corresponding to the given size and offset. ( eg bucket 30-40 for page 3).

We then perform a new query, filtering all results based on the keys from the first query. Next to the query is a term aggregation (on the same field as before), and we add a top_hits aggregation to get the results for those (10) buckets.

This way we don't have to load all 40 buckets and get the top_hits for those buckets, which increases performance.

Loading all buckets and top 10 hits per bucket took around 20 seconds for a certain query. With the above change we managed to bring it back to 100ms.

Info:

This might help someone out as a workaround till such a feature exists within Elasticsearch

davidvgalbraith commented 9 years ago

Hey! I too would like paging for aggregations. That's all.

bauneroni commented 9 years ago

I'd also love to see this someday but I do understand the burden (haven't used that word in a long time) and costs to implement this. This feature would be quite handy for my client's application which is operating on ~250GB+ of data.

Well, yeah.. what he^ said :+1:

vinusebastian commented 9 years ago

@aaneja with respect to "Terms aggregation does not support a way to page through the buckets returned. To work around this, I've been trying to set 'min_doc_count' to limit the buckets returned and using a 'exclude' filter, to exclude already 'seen' pages.

Will this result in better running time performance on the ES cluster as compared to getting all the buckets and then doing my own paging logic client side ?"

How did you exclude already seen pages? Or how did you keep track of seen pages? Also what did you learn about performance issues with such an approach?

dakrone commented 9 years ago

We discussed this and one potential idea is to add the ability to specify a start_term for aggregations, that would allow the aggregation to skip all of the preceding terms, then the client could implement the paging by retrieving the first page of aggregations, then sending the same request with the start_term being the last term of the previous results. Otherwise the aggregation will still incur the overhead of computing the results and sub-aggregations for each of the "skipped" buckets.

To better understand this, it would be extremely useful to get more use-cases out of why people need this and how they would use it, so please do add those to this ticket.

2e3s commented 9 years ago

+1 for that. There may be tens of thousands unique terms by which we group, and gather statistics by subaggregations. It can be sorted by any of these subaggregations, so it's gonna be very costly anyway, but its speed with ES is currently more that bearable as well as its precision, and if not sending such big json data between servers and holding it with PHP (which isn't good at all as for now), it would be fine. I even think of some plugin which would do this simple job. But this still will require computing a sorting subaggregation if used.

a0s commented 9 years ago

+1

benneq commented 9 years ago

+1

pauleil commented 9 years ago

+1

jaynblue commented 9 years ago

+1

dragonkid commented 9 years ago

+1

genme commented 9 years ago

+1

aznamier commented 9 years ago

+1

bobbyhubbard commented 9 years ago

+1

robinmitra commented 9 years ago

+1

GregorSondermeier commented 9 years ago

+1

mfischbo commented 9 years ago

+1

khaines commented 9 years ago

+1

srijan55 commented 9 years ago

+1

nathraQ commented 9 years ago

+1

v4run commented 9 years ago

+1

rmuhzin commented 9 years ago

+1

Siarhei-Yarkavy commented 9 years ago

+1

concordiadiscors commented 9 years ago

+1

dubadub commented 9 years ago

+1

paxnoop commented 9 years ago

+1

leedohyun commented 9 years ago

+1

caJaeger commented 9 years ago

+1

alexcode-lab commented 9 years ago

+1

jpountz commented 9 years ago

I've been thinking more about this recently: I don't think we can reasonably add pagination options to the terms aggregations. However maybe we can make it easier to implement from client-side. Here is a proposal of a plan:

Sorting by term

This case could be handled efficiently if we had options to only run the terms aggregation on a range of terms, similarly to the include option. For instance if you got aaa, aab and abc (size=3) on the first page, you could get the next page by running a terms aggregation on terms that are greater than abc. On server-side, this could be dealt with efficiently by finding the ordinals that match the range boundaries and only emitting ordinals between these boundaries at the fielddata level.

Sorting by count/sub-aggregation

In that case, the easiest/most efficient solution would be to provide the terms aggregation with a list of exclude terms that contains terms that occurred on the previous page. So if you page size is 10 and you want page 4, you would need to exclude the 30 terms that appeared on the first 3 pages. Functionnality already exists so this would mostly be a matter of documentation.

nekulin commented 9 years ago

Yes now you can create auto increment id and filter filter id>19 size=10 and id > 29 size 10. But grouping all the data will still be

Castlejoe commented 9 years ago

Paging should work efficiently for many terms (I have 50.000+) and I don't see how excluding previous terms ("you would need to exclude the 30 terms that appeared on the first 3 pages") would work here. To exclude terms, I would need to have all terms in case the user want to jump to the last page...

For me, and probably many other, the simplest solution would be a perfect start: do exactly what you do today but return only the requested page's data.

abeninskibede commented 9 years ago

+1

emilgerebo commented 9 years ago

+1

nile1801 commented 9 years ago

+1

Jan539 commented 9 years ago

I'm relatively new to this topic so maybe I don't see the problem here but for my use case it is as following:

I send a query with my aggregation (using size 0) and get back a bucket array containing all the results. Now I could store those buckets and pick the first X, then the next X entries. So why isn't it possible to just let ES execute it with size 0 to store all results and give back the first X bucket entries together with the total amount of bucket entries. With the next query I can give the start index of the next X entries.

From the comments above I think I misunderstood smth but this would be my idea.

nvh0412 commented 9 years ago

+1