elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.58k stars 24.36k forks source link

Fingerprint/composite field types #84282

Open dgieselaar opened 2 years ago

dgieselaar commented 2 years ago

In the APM app (and probably in Observability in general) we sometimes use the composite of multiple fields as "keys" for a certain timeseries. E.g., we might use a nested terms aggregation on service.name + service.environment. There are several downsides to this approach currently:

We also use the terms enum API to get a list of service names fast. However, we cannot use this for multiple fields.

One workaround would be to add an ingest processor that "fingerprints" values from multiple fields into a single keyword, and use that to aggregate over this field. However, this comes with the downside of us having to come up with a serialization/deserialization logic.

Ideally, ES can help us here by adding a field type for this purpose - I'm using fingerprint here because a composite field type is already a thing in ES, but the name is probably not the best. The mapping could look as follows:

{
  "properties": {
    "service": {
      "properties": {
        "name": {
          "type": "keyword"
        },
        "environment": {
          "type": "keyword"
        },
        "id": {
          "type": "fingerprint",
          "fields": [
            "service.name",
            "service.environment"
          ]
        }
      }
    }
  }
}

Suppose that we run a terms aggregation on service.id:

{
  "aggs": {
    "service.id": {
      "terms": {
        "field": "service.id"
      }
    }
  }
}

Elasticsearch would return the composite values as follows:

{
  "aggs": {
    "service.id": {
      "buckets": [
        {
          "key": "opbeans-java/production",
          "key_as_value": {
            "service.name": "opbeans-java",
            "service.environment": "production"
          }
        }
      ]
    }
  }
}

Or, when we call the terms enum API (which would have to be a breaking change, I guess?):

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "terms" : [
    { "service.name": "apm-server", "service.environment": "development" },
    { "service.name": "apm-server", "service.environment": "production" }
  ],
  "complete" : true
}
elasticmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

elasticmachine commented 2 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

ywelsch commented 2 years ago

I've tagged both search and analytics teams as this touches on areas covered by both. Each team can discuss and leave their thoughts here.

imotov commented 2 years ago

Do you anticipate a need for multiple fingerprints per document? If not this is what we are basically doing with _tsid in time series indices.

dgieselaar commented 2 years ago

@imotov yeah, I think so. Eg for the service inventory we might only need service name + env, but then when drilling down into the service detail page, we'd like to add transaction type and maybe host name.

nik9000 commented 2 years ago

I'd be curious to see a picture of the thing you are building with the results here. We sure can build fingerprint fields if its the right thing. But maybe the right thing is to make multi-field terms agg faster.

dgieselaar commented 2 years ago

@nik9000 the thing that started this discussion was that we are experimenting with populating the service inventory (our landing page that has a list of all APM services) with the terms enum API to speed up perceived performance. However, one drawback there is that we'd like to filter on/group by environment, and the terms enum API will only return values for a single field. That is something that the multi terms agg cannot solve I think, though I am all in favor of a speed boost for the multi terms agg. We do some cases where we use a nested terms agg instead of multi terms because the former is a lot faster, and multi terms should be the more appropriate agg, in theory.

dgieselaar commented 2 years ago

Another thing I'm wondering about is: suppose we have such a field, on three different fields, e.g. on service.name, service.environment and transaction.type - I'd like to run a terms agg on two of the fields, which would mean that ES would have to merge buckets in the reduce phase - is something like that a reasonable thing to do w/ a field type like this?

Maybe that's more of a TSDB thing though.

nik9000 commented 2 years ago

Es could merge the buckets when reading I think. I'm not sure the exact mechanics, but it should be possible.

On Tue, Mar 1, 2022, 3:33 AM Dario Gieselaar @.***> wrote:

Another thing I'm wondering about is: suppose we have such a field, on three different fields, e.g. on service.name, service.environment and transaction.type - I'd like to run a terms agg on two of the fields, which would mean that ES would have to merge buckets in the reduce phase - is something like that a reasonable thing to do w/ a field type like this?

— Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/84282#issuecomment-1055156206, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUXIVXWNSZFCEKUGXCQRDU5XI4NANCNFSM5PEJ4W3A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

elasticsearchmachine commented 2 weeks ago

Pinging @elastic/es-analytical-engine (Team:Analytics)