Closed SylvainJuge closed 2 years ago
I am pretty allergic to (4). It feels very against our "don't interfere" policy to actually do something to the app. I think I would want it to be opt-in, not opt-out. With the company-wide emphasis on Cloud (which is already solved by (2)), I think opt-in is safer (and most users won't need it since they'll be on Cloud).
I think we risk compromising user trust if we're making unexpected requests against their infra.
I am pretty allergic to (4). It feels very against our "don't interfere" policy to actually do something to the app. I think I would want it to be opt-in, not opt-out. With the company-wide emphasis on Cloud (which is already solved by (2)), I think opt-in is safer (and most users won't need it since they'll be on Cloud).
I think we risk compromising user trust if we're making unexpected requests against their infra.
That's definitely a good point for making it opt-in rather than opt-out. In this case we could add that agents MAY implement this feature. Also, I don't know if we have telemetry to back this hunch but I don't think that there are a lot of applications communicating with more than one cluster, which means in most cases the extra granularity would not provide extra value for the majority of our users, thus skipping it by default would not be very problematic.
A few additional precisions regarding Cloud deployments (I was in the Cloud team for 4 years before joining Clients):
the cluster_name
in GET /
is the cluster id, with the following characteristics:
x-found-handling-cluster
https://cloud.elastic.co/deployments/<cluster-id>
the display_name
in GET /_cluster/settings
on the other hand:
For all these reasons, IMHO it would be better to use the cluster id (i.e. cluster_name
in GET /
) as this is the most stable and most widely used information, with the additional benefit that it can be used to create direct links to the Cloud Console from the APM UI.
My 2 cents 😉
Proposal have been updated to "disabled + opt-in" for the extra request to ES which simplifies it to two cases:
elasticsearch_fetch_cluster_name
option is set to true
(defaults to false
).@basepi can you approve this if it works for you so we can move this forward ?
The use case we have is that we want to make the service map more useful by showing which cluster(s) an application is talking to.
Right now, we just show something like
graph TD;
MyApp --> Elasticsearch;
What we want to show is
graph TD;
MyApp --> esp[Elasticsearch products];
MyApp --> esu[Elasticsearch users];
While the cluster ID might be globally unique, it's not really as descriptive as the human-readable cluster name. So I think we'd prefer the human-readable name to make the service map more meaningful.
However, this would mean that we don't have a way to get the cluster name without actually calling the cluster settings endpoint. I get Colton's point that doing an extra call might be considered quite intrusive, especially if that means the application responds to user requests slower. But what if we get the cluster name in the background and make sure we only do it once? We could then either buffer the spans until we have the cluster name or just send the first ones without the cluster name.
But what if we get the cluster name in the background and make sure we only do it once? We could then either buffer the spans until we have the cluster name or just send the first ones without the cluster name.
I'm not arguing that we're going to slow down the target infrastructure. The vast majority of our users would likely never notice the additional call. But our policy has always been to not affect the target app, and this would be a change to that policy.
Maybe it's worthwhile. But I don't like it. Does that mean we shouldn't do it? I don't know.
The kicker for me was that this is solved for cloud instances (via the x-found-handling-cluster
header). With our continuing focus on cloud it seems like we can be prudent with an opt-in for on-prem, while still getting a lot better data for cloud customers.
I'm open to being overridden, I just wanted to make my opinion known. :)
The kicker for me was that this is solved for cloud instances (via the
x-found-handling-cluster
header)
What I'm questioning is whether the value from this header is descriptive enough. It returns a unique alphanumeric identifier for the cluster (such as d66e0702fa9b48dfbcd629434c4beb83
). Only the cluster settings endpoint would give us a more descriptive display name of the cluster (such as release-oblt (4667fb)
). So if we want a descriptive identifier, I see no other way than to do this extra call.
We have precedent for potentially introducing extra calls to the DB in the Java agent: when we get the metadata for SQL database connection, some drivers implement connection.getMetaData()
in a way that retrieves the meta data from the DB on demand. Therefore, the Java agent has a cache for this.
I just had a good discussion with @Mpdreamz on this. The TL;DR outcome was to use x-found-handling-cluster
and to try to get Elasticsearch to include the cluster.name
in a response header by default. These would be the only two strategies to determine the cluster name.
Some more details:
display_name
GET /_cluster/settings
may be problematic as while the client might have permissions to do searches on specific indices, it may not have access to the cluster settings API.x-found-handling-cluster
is the cluster.name
which can be configured by users.
cluster.name
is usually more descriptive.In order to make some progress here, I suggest the following next steps:
elasticsearch_fetch_cluster_name
that allows to opt-in on explicit cluster name retrievalThe opt-in allows to cover existing clusters for the users that need extra granularity. Implementation in agents will be optional and is likely to be added when someone asks for it (with the exception of the Java agent where the PR is already almost ready). Also, making this optional allows to change anything in the agents that already implement the cloud header part.
add an optional configuration
elasticsearch_fetch_cluster_name
that allows to opt-in on explicit cluster name retrieval
I'm not sure if it's worth adding that option, at least for now. It's not really a great user experience and it would be hard for users to discover. At least intuitively, I wouldn't expect that setting to get a lot of usage. It also doesn't always work as the user might be missing ES cluster level permissions. Also, it adds complexity to agents and because it's a rare code path, there's a good chance that it contains undetected edge cases or bugs.
I think I would remove it from the Java agent PR and if it turns out that this is a frequently requested feature, we can think again on how to enable it uniformly across all agents.
Thanks for your input @felixbarny , I agree with you it's simpler if we don't include this extra option for now, as you said, if needed we can always add it later. I'll update this PR accordingly so we can move forward.
Also, the implementation in the Java agent was quite tricky (and thus potentially brittle) because agents have to reuse the request header credentials and use them after the initial response is available.
When fetching the cluster name (unlike the cloud "display name" which is the user-visible name of the cloud deployment but requires some extra permissions), we just make an extra query to /
, which does not require any extra permission as far as I know (this was also the request used for the "product check" in late 7.x clients).
Thus the updated plan is now even simpler now:
x-found-handling-cluster
in HTTP response if present.
/!\ Update: the current state of the proposal is summarized in this comment below.
Initial idea was to capture Elasticsearch cluster name using this strategy.
x-found-handling-cluster
HTTP response header which is provided in CloudHowever, after doing a PoC with Java agent (https://github.com/elastic/apm-agent-java/pull/2796) and talking with @swallez (ES clients team): The
cluster_name
defaults toelasticsearch
, but is usually set by the user for on-premises deployments, in cloud it is a random hex string like3a29c1a34e2fba92d245020741e38977
.GET /_cluster/settings
.For (1), if there was such storage it would be added only in newer clients
For (2), the implementation is trivial, the only downside is that it's not available outside cloud.
For (3), in order to be relevant, it would assume that the application does make a call to GET / or similar, which is very unlikely on most applications.
For (4), it seems to be an acceptable compromise in most cases, but there are caveats
GET /
is enough to get thecluster_name
GET /
was executed once on the first client request, with 8.x this is replaced by a fancy content-type header in response. One issue this approach was with lambda/faas when a single query to ES is made, it would simply double the number of queries. This is of course less an issue with regular backend applications.Proposal
Things to clarify/discuss
[x] can we reliably detect cloud ES deployment at runtime and disable extra query ?
[x] can we reliably detect FAAS execution context and disable extra query ?
faas
agent protocol property.[x] do we need to add an option to opt-out of this feature as fallback ?
Yes, following this feedback switching to opt-in rather than opt-out is preferable.
May the instrumentation collect sensitive information, such as secrets or PII (ex. in headers)?
[x] Create PR as draft
[ ] Approval by at least one other agent
[ ] Mark as Ready for Review (automatically requests reviews from all agents and PM via
CODEOWNERS
)[ ] Approved by at least 2 agents + PM (if relevant)
[ ] Merge after 7 days passed without objections \ To auto-merge the PR, add
/
schedule YYYY-MM-DD
to the PR description.[ ] Create implementation issues through the meta issue template (this will automate issue creation for individual agents)