elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.72k stars 8.13k forks source link

[APM] Avoid using _source for OTel compatibility #189947

Open gregkalapos opened 1 month ago

gregkalapos commented 1 month ago

As we work towards OTel native support, we expect data to be stored in more OTel native format in Elasticsearch. E.g. see: https://github.com/elastic/elasticsearch/pull/111091

The result of this is that the shape of the data will be different compared to what we currently have in the APM data streams.

At the same time, we also add a compatibility layer to make sure the current UI works with the new data. This layer is mainly based on aliases and passthrough fields.

The problem where this currently breaks is that the UI in some cases uses _source to accesses data. That is currently a blocker for the compatibly layer as some of these fields are not directly available under _source.

Specific example:

On the service summary page in this part the UI accesses fields from _source to populate the icons: https://github.com/elastic/kibana/blob/71670968ad6115ffac8d221fb0da400363ffae1a/x-pack/plugins/observability_solution/apm/server/routes/services/get_service_metadata_icons.ts#L75

In this example we have a field that stores agent name.

Here is how an OTel native data in Elasticsearch will look like:

{
  "@timestamp": "2024-08-05T18:31:19.828218000Z",
  "attributes": {
    "metricset.interval": "1m",
    "metricset.name": "service_transaction",
    "processor.event": "metric",
    "transaction.root": true,
    "transaction.type": "unknown"
  },
  "data_stream": {
    "dataset": "generic.otel",
    "namespace": "default",
    "type": "metrics"
  },
  "metrics": {
    "transaction.duration.histogram": {
      "counts": [
        1
      ],
      "values": [
        12500
      ]
    }
  },
  "resource": {
    "attributes": {
      "metricset.interval": "1m",
      "service.name": "sendotlp",
      "some.resource.attribute": "resource.attr",
      "telemetry.sdk.language": "go",
      "telemetry.sdk.name": "opentelemetry",
      "telemetry.sdk.version": "1.28.0",
      "agent.name": "opentelemetry/go",
      "agent.name.text": "opentelemetry/go"
    },
    "dropped_attributes_count": 0,
    "schema_url": ""
  },
  "scope": {
    "name": "otelcol/spanmetricsconnectorv2"
  }
}

See field resource.attributes.agent.name - that is how we store attributes in OTel native data. Everything under resource.attributes can be queried as a top level field, but those fields under _source are of course still under resource.attributes.*. So in practice there is an alias from agent.name to resource.attributes.agent.name.

Currently the query above does something like this:

{
               "track_total_hits": 1,
                "size": 1,
                "_source": [
                    "kubernetes",
                    "cloud.provider",
                    "container.id",
                    "agent.name",
                    "cloud.service.name"
                ],

               "query": {
                    "bool": {
                        "filter": [
                            {
                                "terms": {
                                    "processor.event": [
                                        "metric",
                                        "error",
                                        "metric"
                                    ]
                                }
                            }
                        ],
                        "must": [
                            //... rest of the query
}

Where agent.name will not be returned, because it's used from _source.

Question is: is using _source needed? If e.g. this would be rewritten to use the fields API, then this will work:

                "size": 1,
                "fields": [ //<--- here use `fields` instead of `source`
                    "kubernetes",
                    "cloud.provider",
                    "container.id",
                    "agent.name",
                    "cloud.service.name"
                ],

               "query": {
               //... rest of the query

Of course there may be other ways to do it and there may be some downside of using fields - which I don't know of.

So the 1. proposal is to check if using the fields API is acceptable and if the answer is yes, then the APM UI should move to using that instead of _source. If that's not possible, we should discuss other options.

### Non exhaustive list of \_source usages
- [ ] get_service_metadata_icons.ts
- [ ] get_trace_samples_hits
- [ ] get_error_group_main_statistics
### Sub tasks
- [ ] Step 1: Replace _source with fields queries: https://github.com/elastic/kibana/issues/192606
- [ ] Step 2a: Refactor the normalize function into per data structure serialization logic
  - [ ] Adapt all tests where needed
- [ ] Step 3: Make stacktrace and span links work for OTel data (not a blocker for 8.16)
elasticmachine commented 1 month ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

cauemarcondes commented 1 month ago

From what I could see we can change to fields.

@gregkalapos There are other places on APM where we use the _source response. Do we need to change all those places too?

gregkalapos commented 1 month ago

From what I could see we can change to fields.

Nice 🎉 Great to hear that.

@gregkalapos There are other places on APM where we use the _source response. Do we need to change all those places too?

Yes, this issue is about considering completely moving away from _source in general as it can break our OTel effort. The one above was just one specific example to help understanding, but it's a general issue.

felixbarny commented 1 month ago

From an efficiency perspective particularly in combination with synthetic _source, using fields is preferable over using _source filtering. That's because the full _source first needs to be synthesized using all fields and then only a subset of _source is returned. With synthetic _source, there's an overhead proportional to the number of fields that are fetched. Re-constructing _source needs to fetch all fields.

carsonip commented 1 month ago

get_trace_samples_hits is also using _source: https://github.com/elastic/kibana/blob/74c9570258f9c58d6d84272c5c96d5b5b2282d6e/x-pack/plugins/observability_solution/apm/server/routes/transactions/trace_samples/index.ts#L112C1-L117C9

This causes an error in the UI. image

carsonip commented 4 weeks ago

Same for get_error_group_main_statistics:

{
  "track_total_hits": false,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "error"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "term": {
                  "service.name": "sendotlp"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1724164013052,
                    "lte": 1724164913052,
                    "format": "epoch_millis"
                  }
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "error_groups": {
      "terms": {
        "field": "error.grouping_key",
        "size": 500,
        "order": {
          "_count": "desc"
        }
      },
      "aggs": {
        "sample": {
          "top_hits": {
            "size": 1,
            "_source": [
              "trace.id",
              "error.log.message",
              "error.exception.message",
              "error.exception.handled",
              "error.exception.type",
              "error.culprit",
              "error.grouping_key",
              "@timestamp"
            ],
            "sort": {
              "@timestamp": "desc"
            }
          }
        }
      }
    }
  }
}

I'm going to start a tasklist in this issue to capture _source usages we've encountered

felixbarny commented 4 weeks ago

One usage of _source that we probably can't remove but need to adjust for OTel are span links. They're stored differently but only the _source will have the right ordering of the object array.

There are also other aspects of span links (in particular incoming links from other traces) that need adjustment for OTel.

bryce-b commented 3 weeks ago

I've run into a couple of queries that are a bit tricky and I'm not sure the best way to go about resolving them. For example : https://github.com/elastic/kibana/blob/0c911c8f6217cd97d20b9086f9092e74569a2e63/x-pack/plugins/observability_solution/apm/server/routes/traces/get_trace_items.ts#L209

This returns an entire transaction with a nested format e.g.:

{
   ....
   transaction : {
     id: "1234",
     type: "page-load",
     duration : { 
       us: 123456
     }
     ....
   }
   ....
}

where fields will return :

{ 
  "transaction.id" : ["1234"],
  "transaction.type" : ["page-load"],
  "transaction.duration.us" : [1234567],
}

should the new field responses be marshaled into the nested format, or should downstream dependencies be rebuilt to use the new format?

bryce-b commented 2 weeks ago

I've got an initial PR covering a few APIs so far: https://github.com/elastic/kibana/pull/191647 I went with updating the downstream dependencies to avoid data processing in the browser, with is the preference of the UI team.