elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.77k stars 8.17k forks source link

X-Opaque-ID contains UUID causing ES deduplication to fail #120124

Closed pgomulka closed 2 years ago

pgomulka commented 2 years ago

ES is originally using X-Opaque-ID to trace origin of the request, to help identify slow search request or the use of deprecated functionality. There was an expectation that this value would contain something in a form of user-id. No specification was given on how this id will look like. We documented this in both slow log and deprecation use cases.

The value of the X-Opaque-ID is used when de-duplicating deprecated log messages (together with a key of the deprecated feature). That means if the X-Opaque-ID is always unique, then there will be no de-duplication of those messages. This will cause deprecation log to grow and since we enabled by default deprecation log indexing, it might use significant resources of the cluster.

The problem: Kibana is using the UUID in a form of 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' for the X-Opaque-ID to help diagnose conceptual origin of the HTTP request. From what I understand, it can also pass the value provided by a user on a request to Kibana? But because the value is mostly unique, it generates a lot of duplicated deprecation logs.

I wonder if we could use trace-id for UUID and use something like a UserID for the X-Opaque-Id

elasticmachine commented 2 years ago

Pinging @elastic/kibana-core (Team:Core)

pgomulka commented 2 years ago

we also discussed this on a core-infra sync today. We realised that maybe two separate use cases are mixed here. The tracing of the request across the stack and the identification of the origin of the request. For the tracing we would prefer traceparent/traceid be used. We need to expand the documentation on this. Here is the PR that added the support for this header https://github.com/elastic/elasticsearch/pull/74210/files The traceparent can have a generated UUIDs (following the defined syntax)

For origin of the slow request/depreacated feature request we would prefer if X-Opaque-ID contain an id from a finite set so that we could deduplicate deprecatation logs. As we documented the intended use case is to set a user-id (no format enforced) so that an app/user could be found by a user. For Kibana, maybe an X-Opaque-ID could be something like a call-site. A class name, method? and for a traceparent a UUID could be used?

droberts195 commented 2 years ago

For Kibana, maybe an X-Opaque-ID could be something like a call-site. A class name, method?

Call-site would be useful, but in some cases it would be nice to also know the object that made the call.

Recently we have had a number of support cases where something in Kibana has been submitting a search to Elasticsearch that uses enormous resources, for example SIEM detection rules. In this case it would be nice to not just find out that the call site is "detection rules", but to find out the name of the particular rule that was involved.

I think this change would be in keeping with the original intent of this issue. For both deprecation log de-duplication and identification of the source of heavy searches it's useless to have a UUID for each search done by a particular SIEM detection rule. And in both cases it's useful to know not just that the source of the search was a SIEM detection rule, but which one.

Maybe simply call-site could be used most of the time, with call-site + object name to be used in a select few cases where we anticipate needing to identify the object.

lukeelmers commented 2 years ago

For origin of the slow request/depreacated feature request we would prefer if X-Opaque-ID contain an id from a finite set so that we could deduplicate deprecate log

I completely understand the need to deduplicate the deprecation logs and the challenges that unique opaque IDs cause, but TBH I'm a little confused on this -- don't we call the ID "opaque" exactly because Elasticsearch shouldn't be making any assumptions about its contents?

What makes this challenging for Kibana is that we are doing more with x-opaque-id than just adding request uuid. We now have an execution context service which will be enabled by default starting in 8.1. The service allows plugins to create a context object that is propagated to ES via the x-opaque-id header.

This means that x-opaque-id can now look something like this (quoting an example from #119932):

Then, in Elasticsearch slow logs you might find a hint that a request was initiated by my-plugin-ui-feature feature of my-app application and proxied through my-plugin-server-feature features of my-app-server app:

x-opaque-id: ${requestId};kibana:my-app:my-plugin-ui-feature:42;my-app-server:my-plugin-server-feature:42

All this is to say, if we wanted to change the value Kibana passes via x-opaque-id, it would also affect our implementation of the execution context service. But I'll defer to @mshustov for his thoughts on that.

rudolf commented 2 years ago

I think it's worth discussing what we want to do long term. But we need to find a solution for 7.16/7.17

I think we have the following options:

  1. Remove x-opaque-id from Kibana. Either without any replacement tracing header or by switching to the traceparent header. The Kibana Core can explore the effort and any potential problems that might come out of this solution.
    1. When receiving the x-elastic-product-origin: 'kibana' header, Elasticsearch ignores the x-opaque-id header when deduplicating logs. Would this be a viable alternative?
mshustov commented 2 years ago

There was an expectation that this value would contain something in a form of user-id. No specification was given on how this id will look like. We documented this in both slow log and deprecation use cases.

Since this format's been documented for a while, I agree that it's a bug. In https://github.com/elastic/kibana/pull/71019 Kibana didn't require x-opaque-id to be a user unique identifier and as it's noticed Kibana falls back to a random uuid value if x-opaque-id is not provided.

We now have an execution context service which will be enabled by default starting in 8.1. The service allows plugins to create a context object that is propagated to ES via the x-opaque-id header. This means that x-opaque-id can now look something like this

In the long term, we are going to migrate from (mis-)used x-opaque-id header to semantically correct baggage header https://github.com/elastic/kibana/issues/101708. The draft status of the baggage header specification is the main blocker for this effort. If it's a critical problem we can consider using another header for execution context propagation.

Remove x-opaque-id from Kibana.

We can't just remove x-opaque-id support as it will be a breaking change. Kibana already documents server.requestId settings https://www.elastic.co/guide/en/kibana/7.16/settings.html However, there are no requirements for x-opaque-id content.

What I'd propose is:

  1. document requirements to x-opaque-id content
  2. stop generating uuid in Kibana if x-opaque-id is not provided
  3. decide what information Kibana should provide via x-opaque-id header. @droberts195 suggested Call-site would be useful, but in some cases it would be nice to also know the object that made the call. . We can provide a static kibana value if execution_context is disabled, or a more specific kibana enitity value if execution_context is enabled. execution_context can hint whether a request is originated from a Kibana visualization or Kibana background task/alert. In either way, we get rid of uuid value in x-opaque-id header which should stabilize the deprecation logging. WDYT? @droberts195 @pgomulka

As a side note: we need to document x-opaque-id content format requirement for other Elastic products. For example, other elasticsearch-js client doesn't document this requirement either https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/observability.html#_x_opaque_id_support

pgomulka commented 2 years ago

document requirements to x-opaque-id content stop generating uuid in Kibana if x-opaque-id is not provided decide what information Kibana should provide via x-opaque-id header. @droberts195 suggested Call-site would be useful, but in some cases it would be nice to also know the object that made the call. . We can provide a static kibana value if execution_context is disabled, or a more specific kibana enitity value if execution_context is enabled. execution_context can hint whether a request is originated from a Kibana visualization or Kibana background task/alert. In either way, we get rid of uuid value in x-opaque-id header which should stabilize the deprecation logging. WDYT?

this sounds great to me, because this is exactly what ES would expect on that header at the moment. I will work on improving the documentation for x-opaque-id and traceparent/trace.id fields

droberts195 commented 2 years ago

We can provide a static kibana value if execution_context is disabled, or a more specific kibana enitity value if execution_context is enabled.

I guess this relates to https://github.com/elastic/kibana/issues/101587 right? (Sorry, I don't know much about prior work in this area.)

In that case I think what I was asking for is covered, but an important point is that all the sub-teams whose Kibana apps might submit heavy searches to Elasticsearch actually take advantage of the execution_context functionality. It probably needs a publicity drive when it's ready for use to ensure that all the people who work on Kibana apps for security, APM, alerts, ML, etc, are all aware that heavy searches should be submitted with an execution_context that clearly identifies where they came from.

Anyway, I've subscribed to that other issue and I don't think it needs covering any more on this issue.

pgayvallet commented 2 years ago

Some technical pointers for the implementation

The requestId is set to the executionContext in the HTTP server, here:

https://github.com/elastic/kibana/blob/755950a52944007f75b7886e548a988b6f12c3dd/src/core/server/http/http_server.ts#L338-L343

We'll probably have to modify getRequestId to return undefined instead of a random uuid when the opaqueId can't be retrieved from the request (either not present or because of the configuration)

This ID is also accessible from the KibanaRequest instance, via the id property

https://github.com/elastic/kibana/blob/338fe1a65d2031db4228c7d2ad2ce13fc75740f3/src/core/server/http/router/request.ts#L130-L138

This is then used to set the header of the request against ES in two places:

https://github.com/elastic/kibana/blob/3c8fa527a7765723d3851afe067315359a4763e7/src/core/server/elasticsearch/client/cluster_client.ts#L97-L99

This is used to set the headers parameter when instantiating a child Client instance, in ClusterClient#asScoped.

https://github.com/elastic/kibana/blob/1a6a9ae43865c888c5d4ec92a3d028cbbabc36d9/src/core/server/elasticsearch/client/configure_client.ts#L41-L45

which is retrieving the value from the ExecutionContextService:

https://github.com/elastic/kibana/blob/d2bd7f848793ffed6ff52717abcad8f35a64b140/src/core/server/elasticsearch/elasticsearch_service.ts#L161

https://github.com/elastic/kibana/blob/4681a80317176543535bfdb2188d94aa40d741fc/src/core/server/execution_context/execution_context_service.ts#L155-L163

As the usage from the ClusterClient is directly accessing the request, we'll likely need a way to differentiate a request having an ID retrieved from the x-opaque-id header coming from the client from a request that have a generated ID.

From a quick grep, I couldn't really find any usage of KibanaRequest.id apart from the ones used in the ES service's code.

We will probably need to either make this KibanaRequest.id value optional, and only populate it when coming from a proper x-opaque-id header, or add a boolean flag to identify if the value was autogenerated or not.

It also makes me wonder why exactly we decided to set a default value to id, given we also have KibanaRequest.uuid which is supposed to be the real technical identifier of the object.

I would personally make KibanaRequest.id optional AND rename it to opaqueId, to reflect more precisely what exactly this value is about.

Then, we'll also need to modify ExecutionContextService.getAsHeader to better handle the case where the requestId / opaqueId is undefined (it actually use a unknownId constant in that case, and we can probably do slightly better now that undefined will be a 'normal' value)

@lukeelmers @mshustov you probably have a better understanding of the whole problematic than I do. WDYT?

mshustov commented 2 years ago

we'll likely need a way to differentiate a request having an ID retrieved from the x-opaque-id header coming from the client from a request that have a generated ID.

Why is it necessary if we aren't going to generate request id anymore? This value from ClusterClient will be overwritten by getExecutionContext, I'm not even sure we need this logic in getScopedHeaders.

It also makes me wonder why exactly we decided to set a default value to id, given we also have KibanaRequest.uuid which is supposed to be the real technical identifier of the object.

I suppose it's because an end-user can set any x-opaque-id value, so we can't rely on the format of the value.

I would personally make KibanaRequest.id optional AND rename it to opaqueId, to reflect more precisely what exactly this value is about.

We can do it since elasticsearch-js uses this naming as well. And there are no other consumers in the Kibana codebase anyway.

Then, we'll also need to modify ExecutionContextService.getAsHeader to better handle the case where the requestId / opaqueId is undefined (it actually use a unknownId constant in that case, and we can probably do slightly better now that undefined will be a 'normal' value)

Not sure we can drop it completely, but we can use something more compact, like -. Note, that in the long term we will move execution_context from x-opaque-id to baggage header with proper semantic value. We are blocked by https://github.com/elastic/apm-agent-nodejs/issues/2102

pgayvallet commented 2 years ago

This value from ClusterClient will be overwritten by getExecutionContext, I'm not even sure we need this logic in getScopedHeaders.

if execution_context.enabled is false, getAsHeader returns undefined,

https://github.com/elastic/kibana/blob/4681a80317176543535bfdb2188d94aa40d741fc/src/core/server/execution_context/execution_context_service.ts#L155-L156

which makes the ES client use the header value specified by getScopedHeaders.

If we modify the execution context service to return a static value when it's disabled, then we wouldn't need to update the logic in getScopedHeaders. but in that case, if we accept that the x-opaque-id header that will be sent to ES is always managed by the EC service, shouldn't we remove the x-opaque-id handling from ClusterClient#getScopedHeaders altogether?

jportner commented 2 years ago

Sorry for the late comment, I was just today alerted to this issue.

The Kibana Security team originally requested the addition of an X-Opaque-ID for each request that Kibana makes to Elasticsearch. Our reasoning is so that analysts can correlate audit different events, both within the Kibana audit log and between Kibana/ES audit logs.

For example, with audit logging enabled in Kibana and ES, if the user logs into Kibana it generates this record:

{"event":{"action":"user_login","category":["authentication"],"outcome":"success"},"user":{"name":"elastic","roles":["superuser"]},"kibana":{"authentication_provider":"basic","authentication_type":"basic","authentication_realm":"reserved","lookup_realm":"reserved"},"trace":{"id":"1a8775bc-abc2-4e1e-99e6-76547e15e96f"},"ecs":{"version":"8.0.0"},"@timestamp":"2022-01-18T16:44:18.254-05:00","message":"User [elastic] has logged in using basic provider [name=basic]","log":{"level":"INFO","logger":"plugins.security.audit.ecs"},"process":{"pid":46291},"span":{"id":"3d9b422762c56078"}}

The trace ID in this audit record is 1a8775bc-abc2-4e1e-99e6-76547e15e96f, which is sent to ES in the X-Opaque-ID header and results in these six audit records:

{"type":"audit", "timestamp":"2022-01-18T16:44:18,182-0500", "cluster.uuid":"4VGQTrEMR663-qySanse2g", "node.id":"PBrUlwyUTi6LxTST5jgJCg", "event.type":"transport", "event.action":"access_granted", "authentication.type":"REALM", "user.name":"elastic", "user.realm":"reserved", "user.roles":["superuser"], "origin.type":"rest", "origin.address":"127.0.0.1:49768", "request.id":"y54HTKqPR8-5MnlZ43Lc9Q", "action":"cluster:admin/xpack/security/user/authenticate", "request.name":"AuthenticateRequest", "opaque_id":"1a8775bc-abc2-4e1e-99e6-76547e15e96f"}
{"type":"audit", "timestamp":"2022-01-18T16:44:18,192-0500", "cluster.uuid":"4VGQTrEMR663-qySanse2g", "node.id":"PBrUlwyUTi6LxTST5jgJCg", "event.type":"transport", "event.action":"access_granted", "authentication.type":"REALM", "user.name":"kibana", "user.realm":"reserved", "user.roles":["kibana_system"], "origin.type":"rest", "origin.address":"127.0.0.1:49567", "request.id":"XdL_lOiQTF-qWVthi-n47g", "action":"indices:data/write/index", "request.name":"IndexRequest", "indices":[".kibana_security_session_1"], "opaque_id":"1a8775bc-abc2-4e1e-99e6-76547e15e96f"}
{"type":"audit", "timestamp":"2022-01-18T16:44:18,192-0500", "cluster.uuid":"4VGQTrEMR663-qySanse2g", "node.id":"PBrUlwyUTi6LxTST5jgJCg", "event.type":"transport", "event.action":"access_granted", "authentication.type":"REALM", "user.name":"kibana", "user.realm":"reserved", "user.roles":["kibana_system"], "origin.type":"rest", "origin.address":"127.0.0.1:49567", "request.id":"XdL_lOiQTF-qWVthi-n47g", "action":"indices:data/write/bulk", "request.name":"BulkRequest", "opaque_id":"1a8775bc-abc2-4e1e-99e6-76547e15e96f"}
{"type":"audit", "timestamp":"2022-01-18T16:44:18,192-0500", "cluster.uuid":"4VGQTrEMR663-qySanse2g", "node.id":"PBrUlwyUTi6LxTST5jgJCg", "event.type":"transport", "event.action":"access_granted", "authentication.type":"REALM", "user.name":"kibana", "user.realm":"reserved", "user.roles":["kibana_system"], "origin.type":"rest", "origin.address":"127.0.0.1:49567", "request.id":"XdL_lOiQTF-qWVthi-n47g", "action":"indices:data/write/bulk[s]", "request.name":"BulkShardRequest", "indices":[".kibana_security_session_1"], "opaque_id":"1a8775bc-abc2-4e1e-99e6-76547e15e96f"}
{"type":"audit", "timestamp":"2022-01-18T16:44:18,192-0500", "cluster.uuid":"4VGQTrEMR663-qySanse2g", "node.id":"PBrUlwyUTi6LxTST5jgJCg", "event.type":"transport", "event.action":"access_granted", "authentication.type":"REALM", "user.name":"kibana", "user.realm":"reserved", "user.roles":["kibana_system"], "origin.type":"rest", "origin.address":"127.0.0.1:49567", "request.id":"XdL_lOiQTF-qWVthi-n47g", "action":"indices:data/write/index:op_type/create", "request.name":"BulkItemRequest", "indices":[".kibana_security_session_1"], "opaque_id":"1a8775bc-abc2-4e1e-99e6-76547e15e96f"}
{"type":"audit", "timestamp":"2022-01-18T16:44:18,193-0500", "cluster.uuid":"4VGQTrEMR663-qySanse2g", "node.id":"PBrUlwyUTi6LxTST5jgJCg", "event.type":"transport", "event.action":"access_granted", "authentication.type":"REALM", "user.name":"kibana", "user.realm":"reserved", "user.roles":["kibana_system"], "origin.type":"rest", "origin.address":"127.0.0.1:49567", "request.id":"XdL_lOiQTF-qWVthi-n47g", "action":"indices:data/write/bulk[s][p]", "request.name":"BulkShardRequest", "indices":[".kibana_security_session_1"], "opaque_id":"1a8775bc-abc2-4e1e-99e6-76547e15e96f"}

We used this approach based on discussions with the ES Security team and based on the ES documentation for audit events:

opaque_id The value of the X-Opaque-Id HTTP header (if present) of the request associated with this event. This header can be used freely by the client to mark API calls, as it has no semantics in Elasticsearch.

This sounds like it's causing serious performance issues, but I am wondering why we are just now hearing about it since Kibana has been doing this since the 7.10 release?

If we remove the X-Opaque-Id header from Kibana's requests to ES, then our users lose the ability to correlate audit records between Kibana and ES. I tested requests to ES using both traceparent and trace-id headers, and neither of those show up in the ES audit records. Perhaps we can make a change to ES to add them? This is really late notice to try to shoehorn this change into 7.17/8.0 though.

The value of the X-Opaque-ID is used when de-duplicating deprecated log messages (together with a key of the deprecated feature). That means if the X-Opaque-ID is always unique, then there will be no de-duplication of those messages. This will cause deprecation log to grow and since we enabled by default deprecation log indexing, it might use significant resources of the cluster.

Instead of changing Kibana, have you considered stripping the X-Opaque-ID from the deprecation logs?

pgomulka commented 2 years ago

We used this approach based on discussions with the ES Security team and based on the ES documentation for audit events:

opaque_id The value of the X-Opaque-Id HTTP header (if present) of the request associated with this event. This header can be used freely by the client to mark API calls, as it has no semantics in Elasticsearch.

This would be ok, but only if the requests that you want to trace do not emit the deprecation log. The part sayingit has no semantics in Elasticsearch is incorrect. We do use it for throttling slowlog/deprecation log

This sounds like it's causing serious performance issues, but I am wondering why we are just now hearing about it since Kibana has been doing this since the 7.10 release?

I don't have an answer to this. We could investigate this, but my guess would be that deprecation logs are not frequently checked by users. They are more likely to check them before upgrading to next major. I suspect we also do not have a cross stack tests that would make it visible earlier.

If we remove the X-Opaque-Id header from Kibana's requests to ES, then our users lose the ability to correlate audit records between Kibana and ES. I tested requests to ES using both traceparent and trace-id headers, and neither of those show up in the ES audit records. Perhaps we can make a change to ES to add them? This is really late notice to try to shoehorn this change into 7.17/8.0 though.

that is correct. Audit logs do not emit trace.id (see config of emitted fields here https://github.com/elastic/elasticsearch/blob/7.17/x-pack/plugin/core/src/main/config/log4j2.properties) We can consider adding that field, but is there any other field from the config linked that would serve that purpose? @elastic/es-security would you be able to advice us on this?

Instead of changing Kibana, have you considered stripping the X-Opaque-ID from the deprecation logs?

We did consider this, but that would be a significant breaking change. The deprecation log throttling based on x-opaque-id is an old feature.

rudolf commented 2 years ago

This sounds like it's causing serious performance issues, but I am wondering why we are just now hearing about it since Kibana has been doing this since the 7.10 release?

There's been a blind spot across the whole stack to monitor and address deprecation warnings. Most components either monitor breaking changes or wait till an ES promotion fails. We need much better tooling here. I also suspect that as we've been approaching 8.0 there has been more deprecated APIs.

As far as I'm aware we don't have evidence that kibana deprecation logs are overwhelming clusters but we don't have an easy way to predict just how much volume a cluster might experience. The latest 7.17 build creates 6k deprecation logs and 60 unique deprecations. But it's hard to translate that into just how much volume a Kibana installation could cause, so I believe it's prudent to err on the side of caution.

We've asked teams to fix code that calls deprecated API's but some deprecations can't be fixed (https://github.com/elastic/elasticsearch/issues/81345) and I don't think we can fix the remaining ones in time.

@pgomulka I wonder if we could use a different deduplication strategy for requests with the x-elastic-product-origin: kibana header and for these requests just ignore the x-opague-id for deduplication? Given that these requests are "system" requests which users have no control over anyway breaking the behaviour here feels like breaking the behaviour of internal implementation detail.

pgomulka commented 2 years ago

I wonder if we could use a different deduplication strategy for requests with the x-elastic-product-origin: kibana header and for these requests just ignore the x-opague-id for deduplication? Given that these requests are "system" requests which users have no control over anyway breaking the behaviour here feels like breaking the behaviour of internal implementation detail.

I don't think we should do this this way. In general case, the problem can be caused by any requests that use x-opaque-id - not only from kibana. I understand that this could be fixed the way you described, but it won't fix the main problem - which is incorrectly used x-opauqe-id

pgayvallet commented 2 years ago

My main concern is about the timing and the version we're expecting this fix to be shipped in.

The 7.17 FF was yesterday. Sure, this issue can be considered a bug and we can potentially merge something after the FF, but given @jportner's reply, we can no longer just remove this UUID generation as initially planned (and started in https://github.com/elastic/kibana/pull/123197), as it would be a breaking change for audit logging.

It seems we now need a solution that would either (or both) require changes on the format/data of Kibana's audit logs, or ES's audit logs, which feels way more impactful than initially anticipated.

So, is it a necessity to find a solution for 7.17, or would it be acceptable to have this merged in a later version (ideally 8.1), allowing us to have more time to think about the best solution to address this problem?

pgomulka commented 2 years ago

then our users lose the ability to correlate audit records between Kibana and ES.

@jportner can you tell us more about this use case? Is this being part of any solution and advertised/documented? My point is that if this is internal to elastic, and maybe not widely used then we can fix this in next release with trace_id?

@bytebilly do you have details about correlating audit logs and kibana audit logs? Was this correlation based on x_opaque_id agreed?

jportner commented 2 years ago

then our users lose the ability to correlate audit records between Kibana and ES.

@jportner can you tell us more about this use case? Is this being part of any solution and advertised/documented? My point is that if this is internal to elastic, and maybe not widely used then we can fix this in next release with trace_id?

This is documented here: https://www.elastic.co/guide/en/kibana/current/xpack-security-audit-logging.html#_tracing_fields

I appreciate that it may not be obvious if you don't have experience working with audit logs before, but they are designed to track who did what when -- and the number one thing that you need when you have a set of audit logs is to be able to correlate events between those different log files.

Many of users don't need audit logging, but the one that do (governments, regulated industries) heavily rely on this feature. This is especially important for Kibana, which uses its own credentials to make requests to Elasticsearch to make changes to system indices. In other words, tons of audit records show up in ES authenticated as the kibana_system user.

Correlation is one of the main requirements listed in our meta-issue for our "new audit logger" (aka "ECS audit logger") feature that we introduced in 7.11 to replace our legacy audit logger: #52125

rudolf commented 2 years ago

Summary of sync discussion with @pgayvallet @jportner @pgomulka @albertzaharovits @jakelandis

We have two options to avoid Kibana spamming deprecation logs due to it sending a unique x-opaque-id:

  1. Kibana and Elasticsearch should switch to using a unique trace header instead of the x-opaque-id header for audit log correlation. This better fits the intended use case from Elasticsearch for x-opaque-id and feels like the best long term solution. The downside is that it requires changes in Elasticsearch and Kibana and there's a risk that intermediate proxies like on Cloud might not expect this new header.
  2. Change the Elasticsearch deduplication logic to ignore x-opaque-id when performing deduplication only when the x-elastic-product-origin: kibana header is present. We considered ignoring the x-opaque-id header for all requests with the x-elastic-product-origin header, e.g. requests coming from all Elastic stack components. However, this could make it harder to trace the origin of deprecation logs from stack components which are correctly using the x-opaque-id header.

Given the limited time we have to get this fixed for 7.17 we decided that option (2) has less risk especially if we only change this behaviour for deprecation logs from Kibana.

Having said that, we will prioritize working on (1), targeting the 8.1 release.

jakelandis commented 2 years ago

we decided that option (2)

It was a short impromptu meeting, and I wouldn't consider this decided.

Change the Elasticsearch deduplication logic to ignore x-opaque-id when performing deduplication only when the x-elastic-product-origin: kibana header is present.

I have concerns with this approach. As mentioned it does not address the root of the problem that x-opaque-id is used in an unexpected way, it requires ES to behave specially for Kibana, and it allows Kibana to call into deprecated APIs with fewer consequences. IMO, this is 3 reasons not do choose this approach.

I would prefer to better understand which deprecated APIs Kibana is calling into and why the usage can not be upgraded. If Kibana does not call into any deprecated APIs this is a non-issue. If there are valid reasons why Kibana needs to call into deprecated APIs, then I would prefer to define a strategy that allows ES to not log those. A generic approach that does not require ES to behave differently for only Kibana and is scoped only to that request (not an entire product).

Do you have an idea of which API's Kibana are calling that is triggering this unwanted behavior ?

rudolf commented 2 years ago

There are currently 60 unique deprecation warnings and we are tracking them here https://docs.google.com/spreadsheets/d/1UHoQx6krVS9zYAL3Mjux7Onb3cFgDlKQoatJtH8_aVE/edit#gid=1574407978

But there are also some deprecations which are hard to mitigate e.g. https://github.com/elastic/elasticsearch/issues/81345

jakelandis commented 2 years ago

Do we plan to ship 7.17 with 60 deprecations from Kibana ? FWIW, the de-deplication is per-node, so assuming these are not duplicated then in a perfect round robbin with 100 node cluster these would 60 deprecations could result in 6000 deprecations log entries. If we intend to ship with that many deprecations, then I think having a strategy to hide these is considerably more important than the current discussion of ensuring these are de-duplicated.

Do you have an estimate of how unique deprecations are expected by the time we ship 7.17 ?

rudolf commented 2 years ago

Do you have an estimate of how unique deprecations are expected by the time we ship 7.17 ?

In https://github.com/elastic/dev/issues/1843 we decided to only fix deprecations that were low risk given that 7.17 was meant to be a stability release. There are no plans to reduce the deprecations triggered in 7.17, we expect all 60 of these unique deprecations to remain.

I think having a strategy to hide these is considerably more important than the current discussion of ensuring these are de-duplicated.

In https://github.com/elastic/dev/issues/1843 the ES team decided to propagate the x-elastic-product-origin header into the deprecation logs so that the upgrade assistant can filter them out. So users won't see any deprecation logs, but they'll still consume disk space.

What do you have in mind with "hiding" them? I don't think having these logs that are anyway hidden by upgrade assistant adds much value, so I think we could consider not creating deprecation logs for elastic stack components at all. I'm not sure what implications made us decide against this approach initially.

rjernst commented 2 years ago

in 7.17, we expect all 60 of these unique deprecations to remain

Just to double check, are these fixed in 8.0? At least one of those is breaking in 8.0.

rudolf commented 2 years ago

in 7.17, we expect all 60 of these unique deprecations to remain

Just to double check, are these fixed in 8.0? At least one of those is breaking in 8.0.

Kibana 8.0 CI is green so we have high confidence that everything has been addressed.

droberts195 commented 2 years ago

At least one of those is breaking in 8.0.

I think it means that at the point during a rolling upgrade where Elasticsearch has been upgraded to version 8.x but Kibana is still on version 7.17, the part of Kibana that uses the removed feature will not work. So it might be worth confirming that the usage is in a non-critical part of Kibana and not something that will cripple the whole Kibana server during this part of the rolling upgrade. I guess it's the include_type_name argument. So is that ever used somewhere in 7.17 where if the endpoint returned an error instead of a deprecation then Kibana would completely stop working?

rudolf commented 2 years ago

Yeah we've run Kibana 7.17 CI against 8.0 and a few tests were failing which we're tracking here https://github.com/elastic/kibana/issues/123127

At least one of the include_type_name deprecations comes from https://github.com/elastic/kibana/issues/122933 but we don't consider that critical functionality or a release blocker

lukeelmers commented 2 years ago

I don't think having these logs that are anyway hidden by upgrade assistant adds much value, so I think we could consider not creating deprecation logs for elastic stack components at all. I'm not sure what implications made us decide against this approach initially.

I tend to agree, it feels like they aren't all that useful, but there could be some context I am missing. @jakelandis & team, what do y'all think?

jakelandis commented 2 years ago

The deprecation logs are intended to viewed by users such that they can be made aware of changes that they should be making prior to upgrading. If stack products are the cause of these deprecation logs(and the user can not do anything to get rid of them) we are diluting the value of these logs and complicating the upgrade.

I took us down a fork of the initial issue. The initial issue of the high duplication is independent (just a compounding factor) of how many deprecation we expect emitted from Kibana. ~I will fork this issue to an email conversation with those commented here for further discussion.~

jakelandis commented 2 years ago

What do you have in mind with "hiding" them? I don't think having these logs that are anyway hidden by upgrade assistant adds much value, so I think we could consider not creating deprecation logs for elastic stack components at all. I'm not sure what implications made us decide against this approach initially.

Apologies... I was mis-understanding how x-elastic-product-origin was implemented. I see now that Kibana does indeed "hide" these, but only after they have been written to the index and hidden only via the UI. My intent is not re-open those decisions and hiding via the UI (mostly) satisfies my concerns around the 60 depredations expected to be shipped (~and will voice those lesser concerns via email~ edit submitted https://github.com/elastic/elasticsearch/pull/82833 to handle unbounded growth of the log).

...back on topic ...

Change the Elasticsearch deduplication logic to ignore x-opaque-id when performing deduplication only when the x-elastic-product-origin: kibana header is present.

If x-elastic-product-origin from Kibana is always hidden from the users view, then why only ignore x-opaque-id for duplication. Wouldn't a simpler option be skip logging the deprecations entirely ? Wouldn't that have the same effect per the user's perspective ? (again, not trying to reopen old decisions, just trying to understand).

pgomulka commented 2 years ago

If x-elastic-product-origin from Kibana is always hidden from the users view, then why only ignore x-opaque-id for duplication. Wouldn't a simpler option be skip logging the deprecations entirely ? Wouldn't that have the same effect per the user's perspective ? (again, not trying to reopen old decisions, just trying to understand).

I don't think we considered skipping logging the deprecations originating from elastic entirely last time (when the issue was originally created). I personally would be worried about this idea. How else would we track what deprecated features are still used within the stack (kibana for instance?) We could obviously push that responsibility on the clients to store the warning headers with a deprecations. It sound to me, like this would mean clients (like kibana) would have to effectively implement and maintain their own deprecation log.

back on the issue and the discussion summary. I understand that fixing 60deprecations in short timeframe is not possible and option 1 is also less favourable because it requires more changes. I am fine with option 2 (elasticsearch not using x-opaque-id for throttling deprecations from kibana only) if the 7.17 timing is the most important.

However, I don't think we considered pushing back with releasing 7.17 and 8.0. Would that even be possible to get more time and implement option 1. If I understand it right, the only downside of this option is the short deadline for 7.17. I don't think traceparent could affect cloud proxy. We already support traceparent in elasticsearch. The only missing bits for option 1 are:

@igore-kupczynski Just to confirm. Do you think there could be any problem with cloud proxy if kibana start sending traceparent to elasticsearch?

pgayvallet commented 2 years ago

However, I don't think we considered pushing back with releasing 7.17 and 8.0. Would that even be possible to get more time and implement option 1. If I understand it right, the only downside of this option is the short deadline for 7.17.

Unfortunately it's not. There are some significant technical implications for option 1 that we discovered yesterday while discussing with @mshustov and @jportner

During the investigations of https://github.com/elastic/kibana/pull/123197, we discovered a few things that would be problematic if we'd switch to using the traceId value from the traceparent header for audit logging correlation:

At the moment, Kibana is already sending a traceparent header to elasticsearch, then, as your said, this traceparent header is used by ES to retrieve the traceId and other tracing values from Kibana since https://github.com/elastic/elasticsearch/pull/74210

However, this header is sent by APM, meaning that this is done only if APM is enabled on Kibana. When APM is disabled, there's no traceparent header circulating.

Initially the traceparent support was supposed to be ONLY for log correlation. As Audit Logging was using the x-opaque-id header that was (is atm) always populated with a unique value, having traceparent be optional depending on the configuration was not an issue. If we switch AL's implementation to use traceId instead of xOpaqueId, this header being optional becomes a problem.

We can't really just change our ClusterClient implementation to send a traceparent header, as this would (potentially?) conflict with the one APM is sending. We can't also just let APM handle this, as when APM is disabled, this header would not be sent, therefor breaking the whole audit logging correlation. Our only options would be to either manually generate / send this header when APM is disabled, or (if that's possible) disable APM from handling this header, and always generating it manually to make sure that it is send regardless of Kibana's configuration. Both are non-trivial.

Note that we also got a divergence between 8.1 and older versions, as https://github.com/elastic/kibana/pull/118466, which is doing some work in this area, was not backported to either 8.0 or 7.17 (as it was not covering a bug, but a feature), which would also increase the complexity of doing works in the 7.17 branch (even it's that not blocking, just pointing this out).

and kibana not using generated UUID in x-opaque-id (already in progress for 7.17)

This PR is on hold until we agree on a technical solution as it currently breaks audit logging.

rudolf commented 2 years ago

Before we dismiss dropping Kibana deprecation logs completely as an option. Kibana is already relying only on deprecation headers for development/testing. When instrumenting CI to detect deprecations we needed a way to trace deprecations back to the calling site and this was impossible using the logs. Looking for the returned header meant we could find out the exact query and stack trace where the query comes from and log all the context. The only shortcoming is that the deprecation headers don't include a level such as WARNING/CRITICAL.

pgayvallet commented 2 years ago

@jportner given that the default configuration for the APM agent on Kibana is enabled: true, contextPropagationOnly: true, therefor correctly sending the traceparent header, do you think it would be acceptable to document that audit logging correlation (which would now be based on trace-id) would require Kibana to have APM context propagation enabled to function (meaning that APM's context propagation should not be explicitly disabled by the customer)?

If this is acceptable, it would allow us to always delegate the traceparent handling to APM, therefor removing most of the complexity described in https://github.com/elastic/kibana/issues/120124#issuecomment-1017307560. In that case, the steps described by @pgomulka in https://github.com/elastic/kibana/issues/120124#issuecomment-1017273996 would be sufficient, in addition to :

Note that if we go that way, we would kinda be forced to keep this enabled: true, contextPropagationOnly: true configuration for APM as the default.

rudolf commented 2 years ago

What's the risk of not making APM agent configurable at all? Do users have a good reason to disable it?

pgayvallet commented 2 years ago

Do users have a good reason to disable it?

Performances. APM has some impact on performances, even if the contextPropagationOnly mode is supposed to be negligible, given the data collected in https://github.com/elastic/kibana/pull/112973.

So no, I don't see any good reason to disable APM totally and not just enable it in contextPropagationOnly mode (which is the default anyway). But maybe @mshustov has a better idea.

What's the risk of not making APM agent configurable at all

Introducing a breaking change in a minor, and a X.last, version. I feel like we may want to avoid that.

mshustov commented 2 years ago

given that the default configuration for the APM agent on Kibana is enabled: true, contextPropagationOnly: true, therefor correctly sending the traceparent header, do you think it would be acceptable to document that audit logging correlation (which would now be based on trace-id) would require Kibana to have APM context propagation enabled to function (meaning that APM's context propagation should not be explicitly disabled by the customer)?

Worth noting that contextPropagationOnly support was added in https://github.com/elastic/kibana/pull/112973 which is not backported to 8.0, 7.17.

Introducing a breaking change in a minor, and a X.last, version. I feel like we may want to avoid that.

AFAIK We've never documented elastic.apm settings of kibana.yml see https://github.com/elastic/kibana/issues/99328, so changing the settings format is not a breaking change.

Do users have a good reason to disable it?

Users may want to disable the APM agent to reduce runtime overhead, the overhead of sending additional tracing headers. @trentm do you see any other potential problems?

jportner commented 2 years ago

@jportner given that the default configuration for the APM agent on Kibana is enabled: true, contextPropagationOnly: true, therefor correctly sending the traceparent header, do you think it would be acceptable to document that audit logging correlation (which would now be based on trace-id) would require Kibana to have APM context propagation enabled to function (meaning that APM's context propagation should not be explicitly disabled by the customer)?

If this is acceptable, it would allow us to always delegate the traceparent handling to APM, therefor removing most of the complexity described in #120124 (comment).

Agreed, I think this makes sense, we just need to make sure audit logging docs are updated accordingly.

Note that if we go that way, we would kinda be forced to keep this enabled: true, contextPropagationOnly: true configuration for APM as the default.

Yes.

pgayvallet commented 2 years ago

So, to summarize, if we want to go ahead with option 1:

ES side

Kibana side

@pgomulka @jakelandis if we go that way, we gonna need the changes on ES to be available for testing to make sure that audit log correlation is still working after our changes before merging/backporting https://github.com/elastic/kibana/pull/123197

@mshustov @jportner @rudolf before we confirm this plan as being an option, does it seem correct to you, maybe I missed anything?

mshustov commented 2 years ago

@pgayvallet Kibana's side looks correct. However, this to specify that APM agent must be enabled (on contextPropagationOnly mode or not) for correlation to work looks like a significant change for Kibana. Don't we need Product approval for this? Maybe we should consider the fact Kibana uses the apm agent as an implementation detail and do not allow customers to disable it?

pgayvallet commented 2 years ago

Maybe we should consider the fact Kibana uses the apm agent as an implementation detail and do not allow customers to disable it?

Given the minimal impact of the agent enabled in contextPropagationOnly, I think it would be reasonable. Now, that's additional changes to perform (and to backport) for 7.17. Do we think this is a necessity, or are we fine with just documenting audit logs as requiring the agent to be enabled to function?

As a sidenote, I see that @pgomulka started working on https://github.com/elastic/elasticsearch/pull/82855, so if we're going with option 2 for 7.17, we would have more time to properly address option 1 in 8.1, which is why I would really like to have @pgomulka and @jakelandis 's opinion on the direction we want to take.

jportner commented 2 years ago

before we confirm this plan as being an option, does it seem correct to you, maybe I missed anything?

Seems correct to me, though I'm unclear if audit logs would be missing the trace.id if we did not backport #118466. If that's the case, I think that we need to backport, it's not optional.

Now, that's additional changes to perform (and to backport) for 7.17. Do we think this is a necessity, or are we fine with just documenting audit logs as requiring the agent to be enabled to function?

I think that it's a necessity. Otherwise someone could have audit logging enabled (with correlation) in 7.16, then upgrade to 7.17 and lose correlation without any warning. We can't realistically expect our users to read through all the docs / release notes to catch something like this, and we don't have the upgrade assistant to surface this info in 7.16.


Another question: are we certain that the APM agent supports contextPropagationOnly in 7.17/8.0? Or is that only supported in 8.1+? The PR where the spec was added is not clear.

pgayvallet commented 2 years ago

Otherwise someone could have audit logging enabled (with correlation) in 7.16, then upgrade to 7.17 and lose correlation without any warning

I get that. To be honest my opinion is that given that APM instrumentation isn't even documented anywhere, such risks are null, but I can't argue that this is technically possible.

Now, forcing the agent to be enabled could also be considered a breaking change, because in the same scenario, a user with APM disabled in 7.16 would upgrade to 7.17 to discover that the agent is now enabled. So in that case, I'm unsure we can safely do that in 7.17.

are we certain that the APM agent supports contextPropagationOnly in 7.17/8.0?

Not atm. This is why we need to backport https://github.com/elastic/kibana/pull/112973

jakelandis commented 2 years ago

Now that I understand that users will never see these deprecation warnings, my initial reservations are dampened.

The issue really comes down to writing more bytes to an index than are expected and/or necessary. Those bytes will likely never be seen by users and are of minimal size (small messages with much repetition). So the problem we are trying to solve is how to avoid trivial levels of overhead writing these messages and minimal bloat in clusters that remain on 7.last for long periods of time.

I think we should solve the problem, but not sure if it is really a blocker to the release.

My initial reservations with the proposed fix :

x-opaque-id is used in an unexpected way, it requires ES to behave specially for Kibana, and it allows Kibana to call into deprecated APIs with fewer consequences

"it allows Kibana to call into deprecated APIs with fewer consequences" - this ship has already sailed and this de-duplication issue should focus on just the de-duplication aspect (sorry if forked the conversation and didn't pick up what was being said)

"x-opaque-id is used in an unexpected way" - There is an argument to be made that ES should not have ntroduced caching semantics to what is intended to be an opaque-id. In this case (and others) we tend to use the x-opaque-id as more of a client-id but it really isn't.

"it requires ES to behave specially for Kibana" - reservation holds

We considered ignoring the x-opaque-id header for all requests with the x-elastic-product-origin header, e.g. requests coming from all Elastic stack components. However, this could make it harder to trace the origin of deprecation logs from stack components which are correctly using the x-opaque-id header.

For me this option (or the like) is back on the table. If we already hide all stack component deprecation for the users, then making it harder to trace the deprecation from stack components is moot point. I would vote for this option... if x-elastic-product-origin use that for caching else use x-opaque-id.

However, @rjernst and @pgomulka are the code owners and have the final say for this part of ES.

trentm commented 2 years ago

[@mshustov]

Users may want to disable the APM agent to reduce runtime overhead, the overhead of sending additional tracing headers. @trentm do you see any other potential problems?

No other potential problems that I see. Just performance overhead of the APM agent.

pgomulka commented 2 years ago

We considered ignoring the x-opaque-id header for all requests with the x-elastic-product-origin header, e.g. requests coming from all Elastic stack components. However, this could make it harder to trace the origin of deprecation logs from stack components which are correctly using the x-opaque-id header.

For me this option (or the like) is back on the table. If we already hide all stack component deprecation for the users, then making it harder to trace the deprecation from stack components is moot point. I would vote for this option... if x-elastic-product-origin use that for caching else use x-opaque-id.

I like that. This is a variation of option 1 - where x-opaque-id would not be used for kibana's requests. Given that all requests with elastic-product-origin are hidden in upgrade assistant, I don't mind using elastic-product-origin for all values when throttling deprecation logs. I will update https://github.com/elastic/elasticsearch/pull/82855 accordingly.

At the same time I will merge #82849 emit trace.id into audit logs and update misleading information in audit log doc about no semantic meaning of x-opaque-id (#82868)

cjcenizal commented 2 years ago

CC @elastic/platform-deployment-management

pgomulka commented 2 years ago

I have merged [7.17] Do no use x-opaque-id for deduplicating elastic originating requests backport (#82855) #83023 into 7.17 should we consider this issue done?

Or maybe we should keep it open until kibana stop sending UUID on x-opaque-id?

I also merged Emit trace.id into audit logs #82849 into 7.17 and updated the audit log documentation.

pgayvallet commented 2 years ago

should we consider this issue done? Or maybe we should keep it open until kibana stop sending UUID on x-opaque-id?

We do still need to address this properly on our side on 8.1 or 8.2. But there is a lot of noise on this issue. I will open a new one summarizing all this, then close this one.

In the meantime, @pgomulka, do you know if an issue was opened to track adding trace-id support on the log output/format on the elasticsearch side? We'll still need that to properly address the issue.

pgomulka commented 2 years ago

In the meantime, @pgomulka, do you know if an issue was opened to track adding trace-id support on the log output/format on the elasticsearch side? We'll still need that to properly address the issue.

There was no issue created. It was added a while back to ES 7.15 We only missed adding this to audit logs.