elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.65k stars 24.65k forks source link

Support Significant Terms in transforms #51073

Open JeffBolle opened 4 years ago

JeffBolle commented 4 years ago

Adding Terms and Significant Terms Aggregations in Transforms:

Having Terms and Significant Terms aggregations supported in transforms would make them substantially more useful for creating entity-specific data structures and indexes. This would not need to be a full implementation with multiple sub-buckets, etc., but simply allowing for arrays of strings to be produced as part of a transform would add a lot of flexibility and remove the need for ugly hacks.
In our use case, we have a unique userid that is used to track activity across hundreds (or thousands) of urls with various activity on each one. We use dataframe transforms to aggregate all of the data for each user into a user-centric index. As part of that data structure we would like to have a simple list of all of the URLs visited (or the top X URLs, or the top interesting URLs) which would be produced by a terms aggregation on the original index. Currently we use a scripted metric aggregation to generate a map of URLs with counts, and then an ingest processor to to extract the urls into a list in another field when ingesting the output from the dataframe. While this solution is currently solving part of my problem, it is cumbersome to maintain and does not allow me to take full advantage of all of the features of a true terms aggregation, much less a significant terms aggregation. A more straightforward approach would be to allow us to declare the terms aggregation directly in the pivot for the transform. I have a number of additional use-cases where this would allow us to much more easily extract additional information from our data and remove additional application logic that is just dedicated to further processing and enrichment of the user-centric objects created from the dataframe transform. In short, this would take us from the dataframe doing 50% of the necessary work to 90+% of the work needed to create our fully featured user-centric object.

elasticmachine commented 4 years ago

Pinging @elastic/ml-core (:ml/Transform)

hendrikmuhs commented 4 years ago

@JeffBolle

Thank you for your feedback! We always look into expanding aggregation support in transforms. Usecases like yours, help us to prioritize what to add next. So keep feedback coming.

I can not give you any promises about when we are able to add terms and significant terms. Especially significant terms needs some thought about how we map the output of the aggregation into the output documents of the transform destination index.

JeffBolle commented 4 years ago

@hendrikmuhs Thank you for the response. I am sure there are an overwhelming number of use cases and edge cases once you start including things like terms aggregations into the the transforms.

I think that maintaining the constraint that the dataframe is a two-dimensional tabular data structure is reasonable, with the extension that the fields in the dataframe could be a list of strings. I understand "two-dimensional" to include lists, but not objects, but I may be wrong.

I have a number of use cases that do not require the full power of terms (or sig terms) aggregations with nested aggregations, etc. to be successful. The simplest first case could be to support creating an array in the dataframe from the keys in the returned terms (or sig terms) aggregation.

I know there are at least a few other people (including some who present at Elastic{On}) who use the scripted metric bucket hack in order to get strings into their dataframes. I think a first implementation that removes the need for the scripted metric bucket workaround would take care of a lot of low-hanging fruit use cases.

hendrikmuhs commented 4 years ago

support for terms and rare_terms have been added and will be available with 7.9.

Therefore changing the title regarding significant terms.