Store upstream paths in transactions/spans for service maps

dgieselaar commented 3 years ago

We currently walk traces (via a scripted metric aggregation) to get paths/connections between services. However, that's untenable for a couple of reasons:

We need to select traces to inspect first, and then walk the traces. This can be slow in many cases, and it's unpredictable.
A scripted metric aggregation is a foot-gun, and it might be removed from the default distribution in the future, meaning we can no longer rely on it in the APM app.
It depends on the presence of spans, meaning we can't just purely use (transaction, span) metrics to power the UI.

One solution is to store (hashed) paths in transaction or span metrics, per @axw's suggestion.

Here's how that could possibly work:

Each service propagates a hash that uniquely identifies the service + the upstream path. Meaning, the root service A propagates a hash of just service A, service B propagates a hash of service A + service B, and so forth.
This hash is propagated via the tracestate header.
These hashes are also stored on transactions/spans (or the derived metrics).

We should consider the following use cases when deciding where and how to store the hashed paths:

Global service maps
Filtered service maps (e.g., by service name or environment)
Dependency statistics (i.e., metrics for one service directly talking to another service/external dependency)

One requirement is that we should be able to resolve all connections with one or two requests, without using a scripted metric aggregation.

dgieselaar commented 3 years ago

Right now the assumption is that we should store these paths on both spans and transactions:

On spans, store the hash as it is propagated (so including its own hash). This allows us to map a point in a path to a service name, via its hash.
On transactions, store the hash that was received via the tracestate header. This allows us to build paths.

Let's suppose we the following service map:

We can describe it with the following events:

[
  { "processor.event": "transaction", "service.name": "a" },
  { "processor.event": "span", "service.name": "a", "span.destination.service.resource": "service-b:3000", "span.destination.hash": "hashed-service-a", "event.outcome": "success" },
  { "processor.event": "transaction", "service.name": "b", "transaction.upstream.hash": "hashed-service-a" },
  { "processor.event": "span", "service.name": "a", "span.destination.service.resource": "service-c:3001", "span.destination.hash": "hashed-service-a", "event.outcome": "success" },
  { "processor.event": "transaction", "service.name": "c", "transaction.upstream.hash": "hashed-service-a" },
  { "processor.event": "span", "service.name": "b", "span.destination.service.resource": "proxy:3002", "span.destination.hash": "hashed-service-a-b", "event.outcome": "success" },
  { "processor.event": "transaction", "service.name": "d", "transaction.upstream.hash": "hashed-service-a-b" },
  { "processor.event": "span", "service.name": "c", "span.destination.service.resource": "service-d:3003", "span.destination.hash": "hashed-service-a-c", "event.outcome": "success" },
  { "processor.event": "transaction", "service.name": "d", "transaction.upstream.hash": "hashed-service-a-c" },
  { "processor.event": "span", "service.name": "b", "span.destination.service.resource": "postgres:3004", "span.destination.hash": "hashed-service-a-b", "event.outcome": "failure" },
  { "processor.event": "span", "service.name": "d", "span.destination.service.resource": "postgres:3004", "span.destination.hash": "hashed-service-a-c-d", "event.outcome": "success" }
]

To get the global service map:

A composite aggregation on transaction.upstream.hash, span.destination.hash, span.destination.service.resource and service.name returns the following data:

  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : null,
      "service.name" : "a",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a",
      "service.name" : "b",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a",
      "service.name" : "c",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a-b",
      "service.name" : "d",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a-c",
      "service.name" : "d",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : "hashed-service-a",
      "transaction.upstream.hash" : null,
      "service.name" : "a",
      "span.destination.service.resource" : "service-b:3000"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : "hashed-service-a",
      "transaction.upstream.hash" : null,
      "service.name" : "a",
      "span.destination.service.resource" : "service-c:3001"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-b",
      "transaction.upstream.hash" : null,
      "service.name" : "b",
      "span.destination.service.resource" : "postgres:3004"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-b",
      "transaction.upstream.hash" : null,
      "service.name" : "b",
      "span.destination.service.resource" : "proxy:3002"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-c",
      "transaction.upstream.hash" : null,
      "service.name" : "c",
      "span.destination.service.resource" : "service-d:3003"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-c-d",
      "transaction.upstream.hash" : null,
      "service.name" : "d",
      "span.destination.service.resource" : "postgres:3004"
    },
    "doc_count" : 1
  }
]

We can then construct the paths by mapping transaction.upstream.hash to span.destination.hash, which will give us connections and paths between services. There are also requests to external services - these leaf nodes can be found by looking for values of span.destination.hash that don't have a corresponding bucket with the same value for transaction.upstream.hash.

For dependency metrics (e.g. request rate from service A to service B or service A to postgres), we should filter the documents on service.name to get the metrics, then an additional request to map the values of span.destination.hash and span.destination.service.resource to either a service name (via transaction.upstream.hash), or - if it's not found - to an external dependency.

dgieselaar commented 3 years ago

We should also look into how this affects the cardinality of transaction/span metrics.

axw commented 3 years ago

@dgieselaar nice!

Say we replaced D with two identical services D1 and D2, and say the proxy load-balances across them. In that case we would have a one-to-many relation from upstream hashed-service-a-b to service names "d1" and "d2". What would we do about span metrics from "b" then? Just show them for the edge between "b" and the proxy, but not from the proxy downstream?

felixbarny commented 3 years ago

Tagging @AlexanderWert who has experience building a metrics-based service map.

eyalkoren commented 3 years ago

@axw what are two identical services D1 and D2 that are interchangeable (load-balanced)? Shouldn't this be considered a wrong setup where user should be advised to set the same service name "D" for both and rely on service.node.name to distinguish between them?

Besides, in @dgieselaar's aggregation example, if they do have different service names configured, the combined key will be different, thus the count can be done separately, or did I misunderstand this?

axw commented 3 years ago

@axw what are two identical services D1 and D2 that are interchangeable (load-balanced)? Shouldn't this be considered a wrong setup where user should be advised to set the same service name "D" for both and rely on service.node.name to distinguish between them?

Sorry, I meant identical in terms of their input/output and interaction with other services, not necessarily the exact same code. They could be two implementations of a service (e.g. you're migrating from a Java to a Go implementation :trollface:), and might have slightly different service.names. Alternatively they could be two instances of exactly the same service, but running in different service.environments (not sure if that should also be included in the hash?)

Besides, in @dgieselaar's aggregation example, if they do have different service names configured, the combined key will be different, thus the count can be done separately, or did I misunderstand this?

From b's perspective: we know we have a path a -> b -> proxy
From d's perspective: we know we have a path a -> b -> d (edges may be indirect, i.e. going through a non-instrumented proxy; d doesn't know about proxy)

If we introduce d2:

From d2's perspective: we know we have a path a -> b -> d2 (d2 doesn't know about proxy)

... and b still doesn't know about either d or d2.

What would we show on the edges proxy -> d and proxy -> d2?

eyalkoren commented 3 years ago

Sorry, I meant identical in terms of their input/output and interaction with other services, not necessarily the exact same code. They could be two implementations of a service ...

Regardless, I believe that any interchangeable nodes (ones that can be load-balanced) should belong to the same service in our terminology and concepts. Any other filtering/aggregation should rely on other data like agent type, environment or node name.

From d2's perspective: we know we have a path a -> b -> d2 (d2 doesn't know about proxy)

I see. Will this be solved if b includes the proxy in the path it sends through the tracestate (meaning - a -> b -> proxy instead of only a -> b)? Alternatively, send the destination in addition to the path hash.

eyalkoren commented 3 years ago

Actually, without this, how would there even be edges proxy -> d and proxy -> d2? Based on what info?

dgieselaar commented 3 years ago

@axw:

Say we replaced D with two identical services D1 and D2, and say the proxy load-balances across them. In that case we would have a one-to-many relation from upstream hashed-service-a-b to service names "d1" and "d2". What would we do about span metrics from "b" then? Just show them for the edge between "b" and the proxy, but not from the proxy downstream?

I didn't intend for the proxy to be shown on the actual service map, my bad. We would ignore it, as we have a match for a span.destination.hash and transaction.upstream.hash, so we would consider it a direct connection between two services.

In this example, I think we could show a split edge from service C to D1/D2, and show the edge metrics once, if that makes sense.

Alternatively they could be two instances of exactly the same service, but running in different service.environments (not sure if that should also be included in the hash?)

Agree that service.environment should be included in the hash, and in the composite aggregation.

AlexanderWert commented 3 years ago

@felixbarny thank you for looping me in. I just wanted to drop in a different idea / approach to realize the service map purely on metric data, thus detaching it from the need of collecting 100% of traces / spans, etc. Feels related to this issue.

The concept is quite simple, based on the following:

As described above, each service would propagate the its own service name (or a hash, doesn't matter)
The called services reads the propagated information and enriches the existing transaction metrics with a "origin" tag.

We would get a set of metrics with the following conceptual structure (here illustrated as a table):

These metrics represent in their tags (origin-service, service) bi-leteral dependencies between services, so they can be used to reconstruct a graph / service map with corresponding metric values attached.

This is just the core idea, if it is of interest I can elaborate more on the details.

With some additional context propagation and tagging of metrics, this approach is quite powerful, and allows for the following (while it is highly scaleable in terms of data collection and query/ data processing):

filtering of the service map based on arbitrary flow characteristics (business transactions, application units, users, etc.)
aggregation of the service map on different levels (application level, service level, node/instance level, region, etc.)
calculation of edge metrics (response times, load)
handling of calls to external services in a similar manner

axw commented 3 years ago

@dgieselaar

In this example, I think we could show a split edge from service C to D1/D2, and show the edge metrics once, if that makes sense.

If I understand correctly, we would have something like (apologies, I do not have @felixbarny's ASCII art mastery):

    (edge metrics here, no indication of split)     
C ----------------------------------------------------->
                                                        |----> D1
                                                        |----> D2

I think that works well. Seeing as the edge metrics are meant to be from C's perspective, I suppose it makes sense that they're not attributed to a particular service on the edges. We can still look at transaction/node metrics for the split.

How would we know that we should remove the proxy from the graph, and that it's in between C and D? Perhaps like @eyalkoren described above, we include the destination service resource (proxy:...) in the outbound hash, and propagate that?

@AlexanderWert thanks for your input!

I just wanted to drop in a different idea / approach to realize the service map purely on metric data, thus detaching it from the need of collecting 100% of traces / spans, etc. Feels related to this issue.

We don't necessarily have to capture 100% of traces/spans. We have recently started aggregating metrics based on trace events in APM Server, and we scale them based on the configured sampling rate. The metrics are then stored and used for populating charts (currently opt-in, expected to become the default in the future.) I think it would make sense to extend these metrics as described above to power the service map.

As described above, each service would propagate the its own service name (or a hash, doesn't matter)

I'd just like to clarify one thing here. IIANM, what you illustrated in the table is a point-to-point graph representation. In that model you're right, it doesn't matter if we propagate the service name or a hash of it (disregarding possible privacy concerns). That's certainly an option, and would keep things fairly simple.

What @dgieselaar has described above is instead a path representation of a graph. This will enable the UI to filter the graph down to a subgraph that includes some node(s), and then only show metrics related to the paths through those nodes and not the excluded nodes. I'd be very interested to hear if you have experience with this approach.

dgieselaar commented 3 years ago

@axw

How would we know that we should remove the proxy from the graph, and that it's in between C and D? Perhaps like @eyalkoren described above, we include the destination service resource (proxy:...) in the outbound hash, and propagate that?

It's removed from the graph by virtue of the span on service C being connected to the transaction on service D, via the hash. I'm not sure if we can tell that there is a proxy in between, or a load balancer, or any other non-instrumented services, even if span.destination.service.resource is included in the hash. But maybe I'm missing something?

eyalkoren commented 3 years ago

I will assume "C" in the last comments was meant to be "B", even though the last one is confusing because there is a c -> d connection as well 🙂

It's removed from the graph by virtue of the span on service C being connected to the transaction on service D, via the hash.

If I read this correctly, it means that given these keys:

  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-b",
      "transaction.upstream.hash" : null,
      "service.name" : "b",
      "span.destination.service.resource" : "proxy:3002"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a-b",
      "service.name" : "d",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  }

there is a transaction.upstream.hash matching a span.destination.hash (hashed-service-a-b), which means there is a a -> b -> d path, so the algorithm will ignore the proxy:3002 exit and not treat it as external service.

However, it looks the same as looking at:

  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-b",
      "transaction.upstream.hash" : null,
      "service.name" : "b",
      "span.destination.service.resource" : "postgres:3004"
    },
    "doc_count" : 1
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a-b",
      "service.name" : "d",
      "span.destination.service.resource" : null
    },
    "doc_count" : 1
  }

How would you know that postgres:3004 is a real external service and proxy:3002 is a proxy to D? This is something you will be able to tell by sending the destination (or a hash of it) in addition to the path.

I'm not sure if we can tell that there is a proxy in between, or a load balancer, or any other non-instrumented services, even if span.destination.service.resource is included in the hash. But maybe I'm missing something?

I think you are right, this is not enough by itself to discover a proxy. Maybe this is something we can rely on request headers for - I think that the Host header should reflect the host and port used by the requested URI at the client side, so they should match. We have both, but not sure how reliable that is. In addition, the X-Forwarded-For header (or the like) can be used to reveal there is some mid tier.

As for load balancing (@axw's example), assuming we do send the destination, this should be easier - if multiple services (transactions) have the same upstream path AND destination, then you have enough info to add a load-balancer node to the map and have metrics for all edges - the edge to the load balancer and each edge from the load balancer to the service.

dgieselaar commented 3 years ago

Ron suggested to do a POC, perhaps we can pivot https://github.com/elastic/kibana/issues/82598 into one? That way we don't need agent support, and we need to calculate paths there anyway. Thoughts?

dgieselaar commented 3 years ago

@eyalkoren:

How would you know that postgres:3004 is a real external service and proxy:3002 is a proxy to D? This is something you will be able to tell by sending the destination (or a hash of it) in addition to the path.

I'm a little confused by postgres here - should that be something like service-d:3004 vs proxy-to-service-d:3004? I guess what we can get from this is that service B is talking to service D via different addresses. But that also might be because there are different instances of service B?

dgieselaar commented 3 years ago

After a quick call with @eyalkoren, I understand what you mean and you are right: the outgoing hash should include the perceived destination. If we don't do that, when service A is talking to service B and postgres via the same hash (hashed-service-a), we would collapse the service A -> postgres connection into the service A -> service B connection.

eyalkoren commented 3 years ago

One more thing to notice- if service B had two nodes behind a load balancer and the user chose to assign each its own unique service name, say - B1 and B2, then adding the destination helps with that as well - once you see that two services get the same upstream path (including the destination, e.g. hashed-service-a-lb:3002), it is enough for you to draw the load balancer node and have accurate metrics for all edges:

               -----> B1
              |
A ---> LB ----|
              |
               -----> B2

axw commented 3 years ago

Ron suggested to do a POC, perhaps we can pivot elastic/kibana#82598 into one? That way we don't need agent support, and we need to calculate paths there anyway. Thoughts?

Sounds like a good idea to me. Perhaps start with a small POC (e.g. using some hand-written data like above) to validate the idea generally, and then expand on that by generating some complex graph data to test the scalability.

dgieselaar commented 3 years ago

To work around the load balancer issue (which is actually happening on dev-next right now, see https://github.com/elastic/kibana/issues/83152#issuecomment-726729162), we could consider having an called service reply with a response header with its own hash. The calling service would then use this hash when storing span metrics. If the response header is not there, the calling service will hash its own hash + destination.service.resource. This would enable us to correctly map most of the calls. If the call to the load balancer fails, or the response header is not set for some other reason, we could group these metrics together and display them separately.

eyalkoren commented 3 years ago

@dgieselaar response headers are of course an option that opens even more possibilities, however it means an implementation of a new capability by all agents, including the potential added complications (e.g. such related to modifying a response). For a quick POC, why not try out what I suggested in https://github.com/elastic/apm/issues/364#issuecomment-725287027?

dgieselaar commented 3 years ago

@eyalkoren How would we correctly attribute span metrics to either B1 or B2? I thought metrics would be aggregated for A -> LB only.

eyalkoren commented 3 years ago

Let's assume we have these data:

  {
    "key" : {
      "span.destination.hash" : "hashed-service-a-lb:3004",
      "transaction.upstream.hash" : null,
      "service.name" : "a",
      "span.destination.service.resource" : "lb:3004"
    },
    "doc_count" : 278
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a-lb:3004",
      "service.name" : "b1",
      "span.destination.service.resource" : null
    },
    "doc_count" : 215
  },
  {
    "key" : {
      "span.destination.hash" : null,
      "transaction.upstream.hash" : "hashed-service-a-lb:3004",
      "service.name" : "b2",
      "span.destination.service.resource" : null
    },
    "doc_count" : 63
  }

Because we append the destination to the path, the fact that two services have the same transaction.upstream.hash implies that they are being load-balanced (there is a service talking to them through the same address). The entry that contains the matching span.destination.hash specifies also the span.destination.service.resource. So, we should be able to tell that service a sent 278 requests to lb:3004, 215 of which were handled by service b1 and 63 by b2. You would use transaction metrics for the lb:3004 -> b1 and lb:3004 -> b2 edges and use the exit span metrics for the a -> lb:3004 edge. Makes sense?

dgieselaar commented 3 years ago

@eyalkoren it does, I was operating under the assumption that we'd use span metrics always. Are there any downsides to mixing those two?

eyalkoren commented 3 years ago

In this case it is actually straightforward to rely on both I think.

In other cases, there may be contradictions, so we will have to decide how we treat those. I agree it will require some thinking. Maybe it even makes sense to put metrics on both sides of edges where relevant 😲

dgieselaar commented 3 years ago

Instead of doing a composite agg on span.destination.hash and transaction.upstream.hash we could possibly use EQL as well. For instance, for joining a span and its resulting transaction, we could use the following EQL query (still using span.destination.service.resource here):

sequence
    [ span where span.destination.service.resource != null ] by span.id
    [ transaction where true ] by parent.id

alex-fedotyev commented 3 years ago

This will be such an awesome improvement! Could you please help me verifying this issue against an API Gateway example below? I've seen it very often in modern environments with our customers.

Assumptions, we instrumented service A,B,C but not D (for various reasons, maybe it is owned by a vendor or can't be instrumented by our agents due to technology limitations). We don't monitor API Gateway today, and will unlikely be able to collect performance information about traces from them. (Although I heard requests in the past to visualize API gateways and load balancers on the map).

With the solutions proposed above, would we be able to successfully draw the right map for this configuration?

Detect and show connections to B and C and corresponding metrics.
And also determine uninstrumented backend(s) (i.e. D) and show their performance metrics separately.

eyalkoren commented 3 years ago

I believe with the current discussed approach you would be able to draw B and C with proper metrics, but not D. I don't think we can rely on attributing excess exit counts from A (based on spans) to an "external service" because mismatches between span metrics and transaction metrics will be common.

In order to support that, we may add a span.destination.service.subresource field that will include the path (or part of the path), eg. /users, /account and /reports and opt in to use it in such cases.

@dgieselaar do I understand correctly that currently the idea is to do a POC with the discussed approach based on span and transaction documents and to apply that in the future to rely purely on stored metrics?

dgieselaar commented 3 years ago

I believe with the current discussed approach you would be able to draw B and C with proper metrics, but not D. I don't think we can rely on attributing excess exit counts from A (based on spans) to an "external service" because mismatches between span metrics and transaction metrics will be common.

If the service responds with its own hash, and the calling service uses that hash to store its span metrics, we would not need transaction metrics, and we would have an "other" bucket where D would fall under, but calls that fail at the gateway or network issues would also fall into that bucket.

@dgieselaar do I understand correctly that currently the idea is to do a POC with the discussed approach based on span and transaction documents and to apply that in the future to rely purely on stored metrics?

Yes, but I'm not sure if we will get to that in the 7.11 timeframe. Might need @sqren or @graphaelli here for some prioritisation. Also, there are a couple of approaches in play, I'm not sure if we decided which one is best. We can investigate some of it in a POC.

eyalkoren commented 3 years ago

but calls that fail at the gateway or network issues would also fall into that bucket.

Exactly, so in this regard it means adding complication but being left with the same limitation. Anything we can do with existing data is highly preferable and can be POC'd right away. I don't say response headers are out of the question and I recognise they have potential for additional value, but they will delay your POC quite a bit and they will probably delay GA, so if we can have something useful without them, I think it is a good start.

Moreover, one limitation to keep in mind with response headers is that they will not be able to support async communication, like messaging systems. If you think of a message bus used to create requests to multiple services, you can support that through the use of different destination resources/sub-resources (e.g. message queues/topics), but response headers are irrelevant for such use cases.

sorenlouv commented 3 years ago

Yes, but I'm not sure if we will get to that in the 7.11 timeframe. Might need @sqren or @graphaelli here for some prioritisation. Also, there are a couple of approaches in play, I'm not sure if we decided which one is best. We can investigate some of it in a POC.

We are already pretty strapped for time and since service maps is not on the roadmap goals for 7.11 any bigger improvements will have to wait until 7.12.

dgieselaar commented 3 years ago

Anything we can do with existing data is highly preferable and can be POC'd right away. I don't say response headers are out of the question and I recognise they have potential for additional value, but they will delay your POC quite a bit and they will probably delay GA, so if we can have something useful without them, I think it is a good start.

Can you elaborate why no data changes are needed for your suggested approach? AFAICT, there needs to be some kind of property on transaction metrics that identifies its upstream, and I don't think we have that yet?

eyalkoren commented 3 years ago

Can you elaborate why no data changes are needed for your suggested approach? AFAICT, there needs to be some kind of property on transaction metrics that identifies its upstream, and I don't think we have that yet?

You're right, we need to implement the addition to tracestate and the addition of these hashes to transactions/spans. Can we eliminate the need for those if we rely on response headers? If so, it certainly worth considering now. If not, the usage of the destination to solve this problem will rely on existing field/s, so what I suggest is start with that and add response headers if we decide later.

alex-fedotyev commented 3 years ago

In order to support that, we may add a span.destination.service.subresource field that will include the path (or part of the path), eg. /users, /account and /reports and opt in to use it in such cases.

Just to add here - we don't know ahead of time how the destination route is defined in terms of API gateway rules. It could be as simple as a part of the URL, but it also could be something more advanced like HTTP headers or something else.

Would it be possible to dynamically calculate traffic to D based on overall traffic to host:port minus traffic to A and B?

eyalkoren commented 3 years ago

we don't know ahead of time how the destination route is defined in terms of API gateway rules. It could be as simple as a part of the URL, but it also could be something more advanced like HTTP headers or something else.

My suggestion is to make it opt in anyway, so you might as well add a complex (meaning - non-boolean) config to define the routing factor. But I'd say this is for advance use cases.

Would it be possible to dynamically calculate traffic to D based on overall traffic to host:port minus traffic to A and B?

Probably not, especially if we are moving to rely on metrics in the future. When using metrics you must assume discrepancies between metrics reported about the same connection from two different services, even if you enforce rigid synchronization of metric collection and sending. Using counts in such metrics (as opposed to rates/percentages) may be very tricky. Even without moving to metrics, any dropped transaction will cause the creation of a false node.

eyalkoren commented 3 years ago

Another thing to keep in mind when implementing this mechanism: there are cases where the middle-node is such that we DO want to show, even if there is only one direct connection, for example - message queues. According to the current suggestion, whenever there is a single match between span.destination.hash and transaction.upstream.hash, this will result with one direct edge. However, if the span.destination.hash comes from messaging spans, we want to show the message queue between the two services.

axw commented 3 years ago

Alternatively they could be two instances of exactly the same service, but running in different service.environments (not sure if that should also be included in the hash?)

Agree that service.environment should be included in the hash, and in the composite aggregation.

@sqren @felixbarny and I have been discussing adding configuration to APM Server to set a default value for service.environment where unspecified by agents.

I don't think it makes sense to include the environment in the hash only sometimes (i.e. only when the environment is known to the agent), as that way we could end up with the same hash for multiple environments (i.e. when the default is changed); so I think we'll have to leave it out altogether. @dgieselaar do you you see any problem with that?

dgieselaar commented 3 years ago

Do you mean leaving out the environment altogether? I think we decided to show calls to a service in different environments separately (correct me if I'm wrong @sqren @formgeist). If the service environment is not included in the hash, I think we will collapse multiple environments into one, is that correct?

axw commented 3 years ago

If the service environment is not included in the hash, I think we will collapse multiple environments into one, is that correct?

Yes. Kinda. We can still filter on service.environment in the metrics docs, but that's only pertinent to the service that generated spans. What you won't get is separation of metrics by upstream service environment.

dgieselaar commented 2 years ago

I've implemented a POC using edge metrics in https://github.com/elastic/kibana/pull/114468. It roughly does as described here. It also shows message queues and load balancers.

However, it does/will not work for Otel, or older agents. My main goal here is to get rid of the scripted metric aggregation, so I want to propose the following alternative for Otel/older agents:

Get a set of trace ids to inspect by using a composite aggregation on span events on service.name and span.destination.service.resource (including null values for the latter). When calculating the service map for a specific service, this request filters for that specific service.
Use the trace ids in a second request to get span ids, by using a composite aggregation on service.name and span.destination.service.resource. This request is not additionally filtered on the possibly focused service.
Use the span ids in a third request, to get the corresponding transactions created on a downstream service.
In parallel to this request, use the service name and span destination from the second request to query span destination metrics
Build the service map by displaying all connections found in the inspected traces.

The main difference here between the hashed approach is that we'll display any service that was part of an inspected trace. So if service A is talking to service B and service C in the same trace, and we focus on service B, we display both service A => B and service A => C. Is that an acceptable tradeoff?

axw commented 2 years ago

@dgieselaar nice! Sorry it took me a while to get to it, I checked out your POC and it looks sensible.

The main difference here between the hashed approach is that we'll display any service that was part of an inspected trace. So if service A is talking to service B and service C in the same trace, and we focus on service B, we display both service A => B and service A => C. Is that an acceptable tradeoff?

I think this is fine. There are likely other data sources we can consume which will only provide point-to-point edge information (I'm thinking of Istio/service meshes), and not paths, meaning we could not filter down to only the relevant paths for a given service.

Do you have an idea of what the performance is like for your proposal for old/OTel agent data, compared to the current scripted metric agg approach? I think it sounds OK, but there will likely be extensive period where users will be running older agents.

felixbarny commented 2 years ago

I have some concerns about the general approach of relying on propagating path hashes downstream and the fact that this makes OTel traces a 2nd class citizen.

The proposal works for situations where all data is coming from up-to-date Elastic APM agents. However, in scenarios where there's a mix of OTel agents and different versions of Elastic agents, it doesn't work reliably anymore.

Seems like there a lots of different goals when it comes to service map improvements. Let me try to untangle them.

Edge metrics I don't quite understand why we can't overlay the span destination metrics on the existing service map. Is the issue that it's currently too complex to do, would it be too slow, or is it technically not possible? Couldn't we just make rendering the service map a two-step process? In the first step, we discover the services and the connections between them. We can remember which service.destination.resource equates to which downstream service. In a second step, we can asynchronously (after the initial map has been rendered) load the edge metrics for each point-to-point connection by looking up the corresponding span destination metrics.
Removing the scripted metric aggregation Do you have a sense of how your OTel/older agents proposal performs compared to the current scripted metric aggregation? If it performs similarly, could we just migrate to that for traces from both OTel and our own agents? I realize that this wouldn't allow for filtering by paths but at least we could get rid of the scripted metric aggregation. While metrics for unique paths are certainly interesting and powerful, maybe a point-to-point-based service map is good enough and easier to get working when diverse agents are used.
Path-aware service map/metrics I'm not quite sold on that benefits outweigh the complexity, metrics storage cost increase (due to the higher cardinality for the path hash), and brittleness when it comes to mixing agents.
Performance improvements Instead of detecting the services and their interactions ad-hoc when the service map is rendered, we could also pre-calculate that based on trace documents and store a service graph snapshot in a dedicated index. The big question is how, and I don't have a good answer here. Most likely, we need more building blocks from the stack here. Stream processing and partitioning the trace id would be ideal but background processing, similar to data transforms, or even an external service that pre-calculates and indexes the graph representation would work, too. Similar to what I've described above, we could overlay that graph model with the existing service destination metrics to get metric-based edge metrics. Aside from a faster service map, I think this could open up more use cases. If we have a pre-computed model about which services run on which hosts, availability zones, etc, and which services depend on one another, we can be smarter about alert grouping and root cause detection. It could also help to more smoothly go from an infrastructure-based view to an application-layer view and vice-versa. But that's definitely beyond the scope of service map improvements.
Avoid basing the service map on transaction and span documents. Instead, purely use metrics. Isn't that one implementation approach to facilitate performance improvements, rather than a top-level goal?

dgieselaar commented 2 years ago

@axw:

Do you have an idea of what the performance is like for your proposal for old/OTel agent data, compared to the current scripted metric agg approach? I think it sounds OK, but there will likely be extensive period where users will be running older agents.

Haven't measured, but it's 3 consecutive requests, so it'll be slower. A lot of it will depend on the sampling rate.

@felixbarny:

Edge metrics I don't quite understand why we can't overlay the span destination metrics on the existing service map. Is the issue that it's currently too complex to do, would it be too slow, or is it technically not possible? Couldn't we just make rendering the service map a two-step process? In the first step, we discover the services and the connections between them. We can remember which service.destination.resource equates to which downstream service. In a second step, we can asynchronously (after the initial map has been rendered) load the edge metrics for each point-to-point connection by looking up the corresponding span destination metrics.

We can overlay them, but it won't solve any performance issues.

Removing the scripted metric aggregation Do you have a sense of how your OTel/older agents proposal performs compared to the current scripted metric aggregation? If it performs similarly, could we just migrate to that for traces from both OTel and our own agents? I realize that this wouldn't allow for filtering by paths but at least we could get rid of the scripted metric aggregation. While metrics for unique paths are certainly interesting and powerful, maybe a point-to-point-based service map is good enough and easier to get working when diverse agents are used.

The scripted metric aggregation is mostly very unpredictable in terms of performance. E.g., if your traces are very long, it will use a lot of memory (and even cause an OOM). The new proposed approach might be a bit slower in some cases but it will definitely be more robust. Fwiw, I think it's reasonable to migrate to that approach first, and then figure out using metrics later.

Path-aware service map/metrics I'm not quite sold on that benefits outweigh the complexity, metrics storage cost increase (due to the higher cardinality for the path hash), and brittleness when it comes to mixing agents.

I think that's a fair concern - we might want to consider making it optional in any case.

Performance improvements Instead of detecting the services and their interactions ad-hoc when the service map is rendered, we could also pre-calculate that based on trace documents and store a service graph snapshot in a dedicated index. The big question is how, and I don't have a good answer here. Most likely, we need more building blocks from the stack here. Stream processing and partitioning the trace id would be ideal but background processing, similar to data transforms, or even an external service that pre-calculates and indexes the graph representation would work, too. Similar to what I've described above, we could overlay that graph model with the existing service destination metrics to get metric-based edge metrics. Aside from a faster service map, I think this could open up more use cases. If we have a pre-computed model about which services run on which hosts, availability zones, etc, and which services depend on one another, we can be smarter about alert grouping and root cause detection. It could also help to more smoothly go from an infrastructure-based view to an application-layer view and vice-versa. But that's definitely beyond the scope of service map improvements.

Yeah, I don't think there's a way to reliably do this today, or in the near future. Plus there are downsides to post-processing (historical data, searchable snapshots, etc).

Avoid basing the service map on transaction and span documents. Instead, purely use metrics. Isn't that one implementation approach to facilitate performance improvements, rather than a top-level goal?

It is, but the performance/reliability improvements are pretty important IMO, and I have not seen an alternative yet. Additionally, requiring transactions/spans for the service map to work means that as soon as those are dropped, it'll break, and the accuracy will decrease as the sampling rate goes down.

felixbarny commented 2 years ago

A lot of it will depend on the sampling rate.

Can we limit the number of traces we're looking at to detect connections?

The new proposed approach might be a bit slower in some cases but it will definitely be more robust. Fwiw, I think it's reasonable to migrate to that approach first, and then figure out using metrics later.

That's great!

We can overlay them, but it won't solve any performance issues.

ack

I think that's a fair concern - we might want to consider making it optional in any case.

I think we should be opinionated and have exactly one way that a service map is drawn. My main concerns are usability, maintenance and code complexity. To maintain backwards compatibility, we may be forced to maintain two implementations at the same time. If that gets multiplied by different ways to draw the service map things get even more dire.

Yeah, I don't think there's a way to reliably do this today, or in the near future. Plus there are downsides to post-processing (historical data, searchable snapshots, etc)

Agreed. But mid- to long-term it's something we could work on together with the stack team.

as soon as those are dropped, it'll break, and the accuracy will decrease as the sampling rate goes down.

That's unless we store a persistent representation of the service connections that still works when traces are deleted.

dgieselaar commented 2 years ago

Can we limit the number of traces we're looking at to detect connections?

We have to use a composite agg to get a diverse set of traces, and that means iterating over the whole data set (at least, restricted by transactions and exit spans). The amount of traces we inspect can be a constant and a relatively low amount.

I think we should be opinionated and have exactly one way that a service map is drawn. My main concerns are usability, maintenance and code complexity. To maintain backwards compatibility, we may be forced to maintain two implementations at the same time. If that gets multiplied by different ways to draw the service map things get even more dire.

I too would like to have one way - but I don't think these will result in vastly different implementations, at least not on Kibana's side. One issue I do see is that we cannot (easily) auto-detect which version to use.

Agreed. But mid- to long-term it's something we could work on together with the stack team. That's unless we store a persistent representation of the service connections that still works when traces are deleted.

IMO these options are so far out that I feel we are better off ignoring them for the sake of this conversation.

There's one more downside to the OTel/legacy approach: it can only see connections from the perspective of the caller. Let's say you have service A talking to a messaging system, and that messaging system is talking to service B and C. With Otel, we might only show Service A => messaging system => service B (or service C) due to the sampling approach. This is not an issue with the hashed-based approach.

dgieselaar commented 2 years ago

We met this morning and agreed to move forward with the Otel based approach as a first step. This achieves the goal of removing the scripted metrics aggregation and supporting OTel/legacy agents without any work needed from the agents. Once this is implemented we can evaluate performance and accuracy more appropriately and decide whether or not additional work is needed.

There are some gaps with the Otel based approach that I will try to list here. The main issue is that it is hard to get a representative set of traces without hashes. This is specifically an issue with the focused service map but there are also some scenarios where the global service map might not display some connections. I’ve wireframed some example scenarios:

In the first scenario, we are looking at a focused service map for service C. We get sample traces that include A) a transaction happening on service C, and B) an exit span that talks to service D. This means that it is not guaranteed that we’ll see traces for both A and B talking to C, and D talking to F and E. The dashed connections might not show up, which is more likely as the amount of traces to inspect go up and also when certain connections are occurring more often than others.

In the second scenario, we are looking at a global service map. We get sample traces that include A) a transaction happening on service A, B, C, and D, B) an exit span on service A that talks to an API gateway (#1).

In this scenario, we will see at least the connection between the API gateway and service D, but it is not guaranteed that we’ll see the connection to service C, for the same reason as previously listed - we might have sampled a trace to service D only, based on the exit span.

felixbarny commented 2 years ago

I wonder if using a diversified sampler aggregation on the transaction name field could improve the chances of detecting more connections. In cases where a particular transaction group is dominant, we also connect connections that come from more rare transaction groups.

dgieselaar commented 2 years ago

I wonder if using a diversified sampler aggregation on the transaction name field could improve the chances of detecting more connections.

it might. there's a perf hit that comes with it as well though.

dgieselaar commented 2 years ago

@alex-fedotyev if I look at some of the perf issues that our bigger users have the service map always comes up. I don't think we can easily improve performance with the OTEL-compliant approach without making the data it displays even more unreliable. Does it still make sense to prioritise an OTEL-compliant approach?

elastic / apm

Store upstream paths in transactions/spans for service maps #364