elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.6k stars 24.63k forks source link

Use ValuesSourceRegistry for MultiValuesSourceAggregations #53194

Open polyfractal opened 4 years ago

polyfractal commented 4 years ago

Wire up the MultiValuesSourceAggregationBuilder sub classes to use the values source registry. This should follow a model like Composite, where single source components are loaded into the registry and can be checked out in any combination.

Original Description Below

Today we have three aggregations that use multiple values sources (e.g. multiple fields):

Matrix Stats

Uses the ArrayValuesSourceAggregationBuilder abstraction, which accepts a single array of fields, and all extra parameters (missing, format, script, etc) are presumed to apply to all of the specified fields.

I do not believe there is validation that all fields are the same type, but matrix_stats assumes they are all numerics.

"matrix_stats": {
  "fields": ["poverty", "income"]
}
Weighted Average

Uses the MultiValuesSourceAggregationBuilder abstraction, which conceptually accepts a map of single ValuesSourceAggBuilders (although doesn't actually use VSAB). Each field can define it's own extra params like missing, format, etc, and the overall agg can also have parameters like format.

WeightedAvg also assumes both fields are numerics, but this is not enforced by the builder when fields are resolved

"weighted_avg": {
  "value": {
    "field": "grade"
  },
  "weight": {
    "field": "weight"
  }
}
Top Metrics

Uses AbstractAggregationBuilder directly, because MultiValuesSourceAggregationBuilder had both too much and too little.

@nik9000 can probably elaborate more on why MVSAB wasn't a good fit.

"top_metrics": {
  "metrics": {"field": "value_field"},
  "sort": {"sort_field": "desc"}
}
Bonus round: Composite

Composite is a little different in that it operates totally differently, with it's own set of support classes. But technically it's a multi VS agg as well, and again doesn't share any infra with the others.

So, what to do?

So we basically have three (or four) aggs with three (or four) different abstractions. So we could find some commonality between them all and build something to reduce redundant code, while leaving it flexible for individual circumstances.

Or we just remove the abstractions (ArrayVSB, MultiVSB) and build directly off AbstractAB like top metrics. Right now the extra boilerplate that Array/Multi introduce is not warranted given that they are not reused anywhere.

A final option is to continue in a holding pattern until we have a better idea if more multi-field aggs are coming (e.g. if we add 2d clustering, that would need two fields and provide another example)

We've decided to ignore these aggs in the VS Refactoring for the time being, but we should probably do something about them sooner than later.

elasticmachine commented 4 years ago

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

wylieconlon commented 3 years ago

We just ran into this issue in Lens and it prevents our adoption of top_metrics. It seems like these aggs are silently ignoring multi-value sources, which is worse than throwing an error in my opinion. I think we should take the 8.0 major as an opportunity to make a breaking change to multi value fields. Specifically, we should either:

This would be much easier if clients had metadata about multi-value fields, which is related to https://github.com/elastic/elasticsearch/issues/58523 and https://github.com/elastic/elasticsearch/issues/64077.

wchaparro commented 3 years ago

@wylieconlon we missed this one since Zach has left. cc @elastic/es-analytics-geo The issue you are reporting is different from this issues. This issue is about aggs that operate on more than one field, you are taking about fields with multiple values.

Could you please open an issue for us? Thanks!

not-napoleon commented 2 years ago

We discussed this at the meeting on 2022-01-05. Summary:

I'm going to update the title and description of this issue to reflect the current thinking, as well as remove Team Discuss. I also opened https://github.com/elastic/elasticsearch/issues/82281 to address the unrelated concerns around Top Metrics over multi value fields.