elastic / apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.
https://www.elastic.co/apm
Apache License 2.0
373 stars 111 forks source link

Spec: enhance dependencies granularity for other backends/dependencies #646

Open SylvainJuge opened 2 years ago

SylvainJuge commented 2 years ago

This is a follow-up task of #621 that was focused on relational databases.

For other types of dependencies, we need to define the values for following fields:

The specification will likely rely on the tests/agents/json-specs/span_types.json file that defines the types and sub-types for each kind of backend service which appears in the map or dependencies.

The goal of this specification would be to fill the gaps and remove the question marks in this table :

For each row, we will have service.target.type = subtype.

type subtype destination.service.resource service.target.name related PR/issue
db cassandra casandra, casandra/${cluster} ${cluster} #654
db cosmosdb ? ? #661
db dynamodb dynamodb, dynamodb/${cloud.region.name} ${cloud.region.name} #654
db elasticsearch elasticsearch, elasticsearch/${cluster} ${cluster} #654
db graphql ? ?
db mongodb mongodb, mongodb/${db.instance} ${db.instance} #654
db memcached memcached, ? ?
db redis redis, ? ?
external dubbo ${host}:${port} ${host}:${port}
external grpc ${host}:${port} ${host}:${port}
external http ${host}:${port} ${host}:${port}
external ldap ldap ?
messaging azurequeue azurequeue, azurequeue/${queue.name} #655
messaging azureservicebus azureservicebus, azureservicebus/${queue.name} (or topic). #655
messaging jms jms, jms/${queue.name} ${queue.name} #655
messaging kafka kafka, kafka/${queue.name} ${queue.name} #655
messaging rabbitmq rabbitmq, rabbitmq/${queue.name} ${queue.name} #655
messaging sns #655
messaging sqs #655
storage azureblob
storage azurefile
storage azuretable
storage s3 ?

Given there is quite a lot to define, we might split it into smaller issues / PRs.

adrian-skybaker commented 2 years ago

Where would the right place to raise feedback about the use of ${host}:${port} for HTTP resources? This is quite an obfuscated way of describing them if you have many paths under a single host, eg an organisation wide API Gateway. Some way of allowing the path/uri-template to be introduced in a controlled way would be useful.

SylvainJuge commented 2 years ago

Hi @adrian-skybaker , here we are trying to properly describe the "destination service" of an external call, from the perspective of the caller, for example when using an HTTP client in an application that is instrumented by APM agents.

So, from the perspective of the application, the destination (or service target to fit the new fields names), there is no way to know if a given HTTP call will be routed to a specific host or service, all we know is the next destination that is described by the ${host}:${port} string.

In the case of an API gateway, using an APM agent on the gateway itself could help to show the routing to downstream services. Also, the instrumentation of downstream services should enable to display them in the map properly, albeit not with the intermediate API gateway if the latter is not instrumented.

adrian-skybaker commented 2 years ago

In the case of an API gateway, using an APM agent on the gateway itself could help to show the routing to downstream services.

This isn't possible when using hosted or appliance-type gateway solutions such as AWS API Gateway or Azure API Manager. This problem is compounded if this gateway is used as a proxy for outbound 3rd party requests, in order to secure/manage external communication outside of a secure network: all your external resources now look the same to APM.

Also, the instrumentation of downstream services should enable to display them in the map properly,

This isn't feasible for genuinely external/3rd-party services. An example of a common pattern is https://auth0.com/docs/api/management/v2 - tens or potentially hundreds of different services accessed under a common host. Nor is it feasible for internal legacy services that cannot be modified to attach agents.

I realise that by default {host}:{port} is the only practical default, however the challenge we have is that there appears to be no option in the APM HTTP instrumentation to intervene to override this default. For example, in the Java agent all of the HTTP instrumentation I can see creates a new span and assigns a new resource name in a closed fashion that cannot be controlled (see https://github.com/elastic/apm-agent-java/blob/main/apm-agent-plugins/apm-httpclient-core/src/main/java/co/elastic/apm/agent/httpclient/HttpClientHelper.java).

This means even though the surrounding HTTP client frameworks (Java examples being Feign or JAX-RS client) "know" the resource name (accounting for URI templating to avoid explosions of URIs when data appears in paths), it's still impossible to make APM aware of this.

The result is that the usefulness of our "dependencies" view in APM is quite limited.

eyalkoren commented 2 years ago

@SylvainJuge if the HTTP client provides interceptors (or similar concept) to allow observing requests, can the new related API (e.g. the Java not-yet-release setServiceTarget() API) be used for that? I am not sure whether we currently rely on the server side being aware of the service.target used by the client. If not, maybe this is a way to support that.

Otherwise, it worth considering adding some kind of wildcard matching configuration option like url_groups to allow such custom finer granularity for HTTP.

SylvainJuge commented 2 years ago

As @eyalkoren just said, for now the best option is to use interceptors and call setServiceTarget that are specific to your HTTP client to override the default values.

In the future, if we get other similar feedback on default HTTP destination description (this is the first we got so far) to rely on URL, then adding an option similar to url_groups could be relevant. Also, there could be a high number of other possibilities: HTTP headers, URL parameters, parts of the host name, ..., so we need first to validate that an extra option here would allow to cover most of the use-cases.

eyalkoren commented 2 years ago

Just to make clear how we think interceptors may be used: the way we capture HTTP spans is by instrumenting the client, start a span at the request start and end the span at the request end. During this time, we make sure that the span is made the "active" span on the request-making thread. So if your interceptor's entry is invoked after the HTTP had been started and activated, you can use the API to get it through ElasticApm.currentSpan() and then set the service target thus controlling how your service map will be assembled. Note that this is Java specific, not all agents do the same.

adrian-skybaker commented 2 years ago

So let's take my current stack as an example - targets are declared in https://github.com/OpenFeign/feign, which is delegating to Apache HTTP Client for the actual HTTP communication. And I believe Elastic APM's Java agent is automatically instrumenting Apache HTTP Client (though that's automatic + opaque to me).

By "interceptor", I believe you're proposing using something like https://hc.apache.org/httpcomponents-core-5.1.x/current/httpcore5/apidocs/org/apache/hc/core5/http/HttpRequestInterceptor.html ? I can try this, but I can already see a couple of issues.

  1. Apache HTTP client is a low level component that's unaware of the concept of a "target" as it is declared in Feign. By the time the request reaches this level, any URI templating has already been done. I can try hacks to try and communicate this value using a thread local or a custom request header to pass this information down to the http client, but I can't see it being straightforward. In Java this pattern is the norm - all of the higher level HTTP frameworks I'm aware of where high level "targets" are declared with URI templates (Jersey JAX-RS Client, Spring WebClient, Feign) all then delegate to lower level HTTP client libraries (apache http, the native Java client, etc).

  2. I have to make sure any interceptor runs after the low-level Elastic http instrumentation opens the new span. Delving in to the source, this doesn't look promising - APM instruments org.apache.http.impl.execchain.ClientExecChain, which seems to be called after any HttpRequestInterceptor. So it seems even adding an interceptor wouldn't work (and in any case even if it did, seems like it's quite fragile in relying on a very internal detail of where in the client APM decides to start the span).

It would be much easier if I could just call Elastic APM at the feign interceptor level (this would be preferable even over url_groups, as we can code a single generic interceptor to use everywhere), which is what I first tried. But at this higher level anything set on the span is discarded.

eyalkoren commented 2 years ago

It would be much easier if I could just call Elastic APM at the feign interceptor level

@jackshirazi this sounds like a perfect candidate for an external plugin. Once we have the blog posts to share about that, please do. @SylvainJuge are there OTel attributes that automatically get mapped to our relevant fields by APM Server- for service maps (destination target etc.), contextual data (status code etc.) and so forth?

jackshirazi commented 2 years ago

https://www.elastic.co/blog/create-your-own-instrumentation-with-the-java-agent-plugin

@jackshirazi this sounds like a perfect candidate for an external plugin. Once we have the blog posts to share about that, please do.

adrian-skybaker commented 2 years ago

I'm not sure how custom instrumentation would solve this problem, at least not without a substantial amount of coding and recreation of built-in instrumentation.

Without having tried it, I don't believe it would be as simple as writing instrumentation just for the higher level Feign layer, because creating an exit span at this higher level would be too early: there are still potentially substantial layers of libraries and code still to be called at this point before the request is sent. As the built-in HTTP client instrumentation does, the exit span needs to be started late, when the actual request is made, down in the low level HTTP code.

Instead it seems like we would need to implement two layers of instrumentation - one to inject the detailed HTTP service name from the higher level framework (in this case Feign). Then a forked version of whichever HTTP client library instrumentation (at least apm-apache-httpclient-plugin, though the client used can be easily changed by config, so potentially 2-3 other equivalents as well). This forked version would be able to "inherit" a service target name at the time parent.createExitSpan(); is called.

The overall challenge I see is that, as far as I can tell, when the exit spans are created by the built-in HTTP instrumentations, there is no way to inherit or inject the service target property from the parent span.

SylvainJuge commented 2 years ago

For me we can split this into two sub-issues/challenges:

  1. define an heuristic to extract the appropriate service.target.name value from an exit span, in the case of an HTTP request it could be part of the URL, some header values, ...
  2. allowing modification of the span to set the service.target.name before it is sent to APM Server.

First, for (1) we have to deal with a very high diversity of protocols, if we limit ourselves to HTTP it could be anything from headers, host name, network port, a part of the URL path or query string, as a consequence it would be quite challenging to cover everything through an API or configuration options.

For (2), some agents provide a way to "post-process" the data before it is sent, which maybe in this case would be a relevant option.

I'm not sure if it's a good idea, but if we had the capability to post-process the spans/transactions in the Java agent (which we don't yet), and the heuristic to set the service.target.name can rely only on the attributes that we already capture, then you could implement your desired behavior through post-processing logic as the original URL would be available in the span data.

In your case, does the backend services behind the gateway are also instrumented by the APM agent ? If yes, then the gateway should not be shown as part of the dependencies nor be visible at all. If those backend services aren't instrumented, then the gateway behaves like an abstraction of those and thus becomes a dependency.

adrian-skybaker commented 2 years ago

My view is that (2) is a much better category of solution, because it means you can write a one-off, generic piece of code for a higher level framework that will work in all cases (eg in Java's Feign it would involve a single feign.RequestInterceptor in library code we would then use as standard in many services). Potentially this type of instrumentation could eventually form part of the agent codebase itself.

An alternative but similar solution to post-processing might be to allow inheritance of service.target.name when parentSpan.startExitSpan(...) is called, in some standardised way. In practice even with post-processing any library code we'd write would try and get us to this state anyway: we'd write a very generic post-processor that looked for something populated by a parent span by more specific processing.

In constrast, (1) seems like more like the url_groups concept: it would likely require manually duplicating URL resource names/patterns in separate Elastic config, for every dependency, in every application.

In your case, does the backend services behind the gateway are also instrumented by the APM agent ?

Generally not, the focus for this issue for us has been dependencies which are effectively "external".

SylvainJuge commented 2 years ago

@adrian-skybaker I think for your use-case you could easily use ingest pipelines, that would allow you to use parts of the collected data (which includes the complete URLs of the HTTP client calls) and set the values you want in the service.target.name and destination.service.resource (the UI relies on the latter as of 8.3, but will rely on both in 8.4 and later).

One limitation of this approach though, is that you can only rely on the data that is already captured by the agents, but in your case that does not seem to be an issue.

AlexanderWert commented 2 years ago

See: