Allow for `scan` with aggregations

elastic / elasticsearch-dsl-py

High level Python client for Elasticsearch

http://elasticsearch-dsl.readthedocs.org

Apache License 2.0

3.82k stars 801 forks source link

Allow for `scan` with aggregations #580

Open honzakral opened 7 years ago

honzakral commented 7 years ago

In 5.0 elasticsearch allows a search request with aggregations when using scan/scroll which we should expose.

This has been moved here from elasticsearch-py - https://github.com/elastic/elasticsearch-py/issues/530

macdjord commented 7 years ago

To be clear - this is not about using a scrolled query to fetch the aggregations bit by bit. This is about having a query with both aggregations and hits, and you want to use a scrolled query for the hits while still seeing the aggregations.

Currently, there's no way to do this: Search.execute() doesn't do scrolled queries, while Search.scan() only return an iterator over the hits with no way to access the aggregation results.

My proposal is to add a new method, Search.execute_scan(), which returns a Response object like Search.execute(), but the Response.hits property, instead of being a static list, is a Search.scan()-style iterator.

nguyening commented 7 years ago

I'd also like to see this happen -- if this isn't a priority right now, do you have a suggested workaround for the time being @HonzaKral ? Perhaps even with the underlying elasticsearch-py package

honzakral commented 6 years ago

I think the proper solution is to create a custom Response class that will hide this - it will provide standard access to the aggregations but when iterating over it's .hits attribute will iterate over all the documents (just like currently iterating over scan() works). Exactly as @macdjord said!

This will make it compatible with the standard response.

qiujunda commented 6 years ago

I'm not sure if this should belong here, but the problem I am facing is more of the DSL library being unable to get us more than 10 aggregation results back. Please do correct me if I am wrong, but slicing seems to work only for hits rather than aggregations.

If the "scan" for aggregations can be implemented, I am sure it would be extremely helpful. Meanwhile, for any others who might be facing the problem of only 10 aggregation results in the DSL library, hopefully the workaround here can prove helpful in the meantime.

Perhaps we could also look at allowing pagination for aggregations?

honzakral commented 6 years ago

@qiujunda scan with aggregations still doesn't scan through the aggregations, it just runs the aggregations first and then proceeds to scan through the documents.

For most aggregations you can already set the size parameter to get back more than 10 buckets, to paginate through all possible buckets you need to use composite aggregation which is still not supported in elasticsearch-dsl unfortunately - https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket-composite-aggregation.html

mcinnes01 commented 5 years ago

@HonzaKral is there any update on being able to return aggregates with the scan response?

Skaldenmet commented 5 years ago

This is a pretty big deal tbh. I would like to see this implemented.

darshan2203 commented 5 years ago

Any update for this functionality to scan through aggregate result?

iDmple commented 5 years ago

Same here!

honzakral commented 5 years ago

Just to clarify this is not scanning through results of aggregation, just returning an aggregates first and then scanning through the documents.

To "scan" over aggregations you can use the composite aggregation as shown here - https://github.com/elastic/elasticsearch-dsl-py/blob/master/examples/composite_agg.py

mateoSerna commented 3 years ago

I'm not sure if the problem I'm facing is related to this, but I'm being unable to get the inner_hits from a scan response. Any suggestions would be appreciated.

Regards!

Kosmonafft commented 8 months ago

Any suggestions how to get the aggregations from the response returned by scan() ?