elastic / apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.
https://www.elastic.co/apm
Apache License 2.0
384 stars 114 forks source link

Simplify heuristic for ES cluster name #688

Closed SylvainJuge closed 2 years ago

SylvainJuge commented 2 years ago

/!\ Update: the current state of the proposal is summarized in this comment below.


Initial idea was to capture Elasticsearch cluster name using this strategy.

However, after doing a PoC with Java agent (https://github.com/elastic/apm-agent-java/pull/2796) and talking with @swallez (ES clients team): The cluster_name defaults to elasticsearch , but is usually set by the user for on-premises deployments, in cloud it is a random hex string like 3a29c1a34e2fba92d245020741e38977.

For (1), if there was such storage it would be added only in newer clients

For (2), the implementation is trivial, the only downside is that it's not available outside cloud.

For (3), in order to be relevant, it would assume that the application does make a call to GET / or similar, which is very unlikely on most applications.

For (4), it seems to be an acceptable compromise in most cases, but there are caveats

Proposal

Things to clarify/discuss

apmmachine commented 2 years ago

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

#### Build stats * Start Time: 2022-10-24T04:01:00.926+0000 * Duration: 3 min 23 sec

basepi commented 2 years ago

I am pretty allergic to (4). It feels very against our "don't interfere" policy to actually do something to the app. I think I would want it to be opt-in, not opt-out. With the company-wide emphasis on Cloud (which is already solved by (2)), I think opt-in is safer (and most users won't need it since they'll be on Cloud).

I think we risk compromising user trust if we're making unexpected requests against their infra.

SylvainJuge commented 2 years ago

I am pretty allergic to (4). It feels very against our "don't interfere" policy to actually do something to the app. I think I would want it to be opt-in, not opt-out. With the company-wide emphasis on Cloud (which is already solved by (2)), I think opt-in is safer (and most users won't need it since they'll be on Cloud).

I think we risk compromising user trust if we're making unexpected requests against their infra.

That's definitely a good point for making it opt-in rather than opt-out. In this case we could add that agents MAY implement this feature. Also, I don't know if we have telemetry to back this hunch but I don't think that there are a lot of applications communicating with more than one cluster, which means in most cases the extra granularity would not provide extra value for the majority of our users, thus skipping it by default would not be very problematic.

swallez commented 2 years ago

A few additional precisions regarding Cloud deployments (I was in the Cloud team for 4 years before joining Clients):

For all these reasons, IMHO it would be better to use the cluster id (i.e. cluster_name in GET /) as this is the most stable and most widely used information, with the additional benefit that it can be used to create direct links to the Cloud Console from the APM UI.

My 2 cents 😉

SylvainJuge commented 2 years ago

Proposal have been updated to "disabled + opt-in" for the extra request to ES which simplifies it to two cases:

@basepi can you approve this if it works for you so we can move this forward ?

felixbarny commented 2 years ago

The use case we have is that we want to make the service map more useful by showing which cluster(s) an application is talking to.

Right now, we just show something like

graph TD;
  MyApp --> Elasticsearch;

What we want to show is

graph TD;
  MyApp --> esp[Elasticsearch products];
  MyApp --> esu[Elasticsearch users];

While the cluster ID might be globally unique, it's not really as descriptive as the human-readable cluster name. So I think we'd prefer the human-readable name to make the service map more meaningful.

However, this would mean that we don't have a way to get the cluster name without actually calling the cluster settings endpoint. I get Colton's point that doing an extra call might be considered quite intrusive, especially if that means the application responds to user requests slower. But what if we get the cluster name in the background and make sure we only do it once? We could then either buffer the spans until we have the cluster name or just send the first ones without the cluster name.

basepi commented 2 years ago

But what if we get the cluster name in the background and make sure we only do it once? We could then either buffer the spans until we have the cluster name or just send the first ones without the cluster name.

I'm not arguing that we're going to slow down the target infrastructure. The vast majority of our users would likely never notice the additional call. But our policy has always been to not affect the target app, and this would be a change to that policy.

Maybe it's worthwhile. But I don't like it. Does that mean we shouldn't do it? I don't know.

The kicker for me was that this is solved for cloud instances (via the x-found-handling-cluster header). With our continuing focus on cloud it seems like we can be prudent with an opt-in for on-prem, while still getting a lot better data for cloud customers.

I'm open to being overridden, I just wanted to make my opinion known. :)

felixbarny commented 2 years ago

The kicker for me was that this is solved for cloud instances (via the x-found-handling-cluster header)

What I'm questioning is whether the value from this header is descriptive enough. It returns a unique alphanumeric identifier for the cluster (such as d66e0702fa9b48dfbcd629434c4beb83). Only the cluster settings endpoint would give us a more descriptive display name of the cluster (such as release-oblt (4667fb)). So if we want a descriptive identifier, I see no other way than to do this extra call.

We have precedent for potentially introducing extra calls to the DB in the Java agent: when we get the metadata for SQL database connection, some drivers implement connection.getMetaData() in a way that retrieves the meta data from the DB on demand. Therefore, the Java agent has a cache for this.

felixbarny commented 2 years ago

I just had a good discussion with @Mpdreamz on this. The TL;DR outcome was to use x-found-handling-cluster and to try to get Elasticsearch to include the cluster.name in a response header by default. These would be the only two strategies to determine the cluster name.

Some more details:

SylvainJuge commented 2 years ago

In order to make some progress here, I suggest the following next steps:

The opt-in allows to cover existing clusters for the users that need extra granularity. Implementation in agents will be optional and is likely to be added when someone asks for it (with the exception of the Java agent where the PR is already almost ready). Also, making this optional allows to change anything in the agents that already implement the cloud header part.

felixbarny commented 2 years ago

add an optional configuration elasticsearch_fetch_cluster_name that allows to opt-in on explicit cluster name retrieval

I'm not sure if it's worth adding that option, at least for now. It's not really a great user experience and it would be hard for users to discover. At least intuitively, I wouldn't expect that setting to get a lot of usage. It also doesn't always work as the user might be missing ES cluster level permissions. Also, it adds complexity to agents and because it's a rare code path, there's a good chance that it contains undetected edge cases or bugs.

I think I would remove it from the Java agent PR and if it turns out that this is a frequently requested feature, we can think again on how to enable it uniformly across all agents.

SylvainJuge commented 2 years ago

Thanks for your input @felixbarny , I agree with you it's simpler if we don't include this extra option for now, as you said, if needed we can always add it later. I'll update this PR accordingly so we can move forward.

Also, the implementation in the Java agent was quite tricky (and thus potentially brittle) because agents have to reuse the request header credentials and use them after the initial response is available.

When fetching the cluster name (unlike the cloud "display name" which is the user-visible name of the cloud deployment but requires some extra permissions), we just make an extra query to / , which does not require any extra permission as far as I know (this was also the request used for the "product check" in late 7.x clients).


Thus the updated plan is now even simpler now: