Closed kobelb closed 1 year ago
Pinging @elastic/kibana-core (Team:Core)
77214 proposes custom request headers for the sake of usage analysis; however, there's potentially some overlap with a solution here.
To reverse the question: Which additional information would you like us to send to ES here? I don't think we can automate/detect more than the calling plugin from Core. Do you have something particular in mind? Should we let callers of the ES API send their own custom 'meta' headers in addition to the x-elastic-plugin-origin
planned to be added in #77214? Should they be able to specify an additional x-elastic-feature-origin
or something?
60122 proposes including the "execution context" to all Kibana log entries, but it doesn't propose forwarding these to Elasticsearch for correlation.
Do you want Kibana to forward them to the Elasticsearch server? execution context
+ Kibana slow logs https://github.com/elastic/kibana/issues/58086 would be enough to find a culprit plugin on Kibana side?
Which additional information would you like us to send to ES here? I don't think we can automate/detect more than the calling plugin from Core. Do you have something particular in mind? Should we let callers of the ES API send their own custom 'meta' headers in addition to the x-elastic-plugin-origin planned to be added in #77214? Should they be able to specify an additional x-elastic-feature-origin or something?
To be honest, I was intentionally being vague here because I don't have a solution in mind. Users are primarily looking for a way to answer the question of "What in the world is causing this expensive query to be executed?". I'm open to all solutions that fully or partially allow this question to be answered based on the feasibility and maintainability.
I believe we need to be more specific than the "feature" if I understand that correctly. It would be incredibly useful to pinpoint the "thing" that called Elasticsearch like "Vijay's Visualization" in dashboard "foo" vs. "Lens". @kobelb Can you clarify the level of detail you are proposing for "feature"?
The goal is for an admin looking at an ES slow/expensive query to rapidly find the culprit in Kibana so they can address the stability of their environment.
I believe we need to be more specific than the "feature" if I understand that correctly. It would be incredibly useful to pinpoint the "thing" that called Elasticsearch like "Vijay's Visualization" in dashboard "foo" vs. "Lens". @kobelb Can you clarify the level of detail you are proposing for "feature"?
For sure, happy to revise the original description to make this more clear.
It would be incredibly useful to pinpoint the "thing" that called Elasticsearch like "Vijay's Visualization" in dashboard "foo" vs. "Lens".
I agree it would be useful. However, how does this play with PII concerns? If we send the name of the visualization/dashboard and/or even the user who performed the action could unintentionally leak private information.
What do you think? @kobelb @VijayDoshi
However, how does this play with PII concerns? If we send the name of the visualization/dashboard and/or even the user who performed the action could unintentionally leak private information.
We could just send the id of the SO instead (when working with SO's)
Note that tracking the feature/app that performed initiating SO requests doesn't seems like the part with the most value. I would think that it's currently harder to correlate requests performed against data indices.
We could just send the id of the SO instead (when working with SO's)
@mikecote my understanding is that alerts are not SOs in the traditional sense, would we need special handling for alerts if we used SO ids?
We could just send the id of the SO instead (when working with SO's)
@mikecote my understanding is that alerts are not SOs in the traditional sense, would we need special handling for alerts if we used SO ids?
If we're using the new terminology; correct, alerts are not SOs but do have unique identifiers. We can however use the rule ID because rules are SOs.
@VijayDoshi ^^
The challenging part here is not only the async nature of js but also a necessity to trace an operation across different environments.
browser --> (proxy) --> Kibana server --> (proxy) --> elasticsearch browser --> Kibana server --> elasticsearch Kibana server --> elasticsearch a stack solution (fleet, e.g.) --> Kibana server --> elasticsearch 3rd party app --> Kibana server --> elasticsearch
Ideally, there is a single mechanism covering all aforementioned use cases.
It looks like a classical Observability problem when we want to answer questions:
Given that we already ship the built-in APM agent in the product, how realistic does it look to use it for debugging?
It may be a separate debugging mode
, not enabled by default.
It's not a simple task including several steps (I hope, I didn't miss anything):
But it sounds like we could provide a debug tool that automates all these steps. Not only it allows us not to re-invent the wheel but also dogfood the company solution.
opening a Visualization, using the Index-Management UI,
Almost all the requests to the Kibana server go through data
plugin. That means there is a single point to the instrument the code on the client-side. For example, it can require a unique identifier for every operation, which can be used to correlate async operations in the logs.
task-manager claiming tasks, queries that are run on Kibana startup
The number of sources is more diverse on the server-side: SO client, ES client, data plugin, task manager. With multiple abstraction layers in between, the passing context through several layers might be tedious. That's why I'm thinking about reusing the APM agent infrastructure or at least borrow implementation ideas from it.
However, how does this play with PII concerns? If we send the name of the visualization/dashboard and/or even the user who performed the action could unintentionally leak private information.
Not sure all that information is needed. We are interested in an operation source (for example, type: dashboard
) and its unique identifier mostly. The rest of the data can be obtained from these.
Given that we already ship the built-in APM agent in the product, how realistic does it look to use it for debugging?
That's a great question and one that'll require additional effort to get an answer to. I'm on board with investigating the feasibility of using APM for these types of situations, as it's the ideal solution for distributed tracing. @joshdover and I just recently discussed this, and there's a cross-team effort that we're trying to get going to answer this question.
But it sounds like we could provide a debug tool that automates all these steps.
Long-term, absolutely! For the time being, I think we should figure out whether we're comfortably recommending that users enable APM integration in Kibana and Elasticsearch to a dedicated cluster to diagnose these difficult isssues.
whether we're comfortably recommending that users enable APM integration in Kibana and Elasticsearch to a dedicated cluster to diagnose these difficult isssues
We're certainly comfortable enabling our agents in any rum+nodejs application. That said, the results are probably not well suited for user consumption as currently instrumented - the results are most useful to the developers of the application, mostly us in the case of Kibana.
I'd be happy to help coordinate pulling the right folks in from APM depending your goals. @dgieselaar in particular has put a lot of effort into improving both kibana instrumentation and node agent capabilities and can likely provide expertise across topics.
That said, the results are probably not well suited for user consumption as currently instrumented - the results are most useful to the developers of the application, mostly us in the case of Kibana.
I think we need to be adding human-readable transactions and spans for key user interactions. For example, visualization renders, dashboard renders, async search, discover search, alert executions, action executions etc. With these in place, we should be able to give users much more detailed information about what user-created data may be causing problems in their cluster.
Some folks from the Kibana Core team and APM team are meeting today to discuss what's possible and craft a plan of next steps for leveraging APM more widely. I'll report back here with our results.
While APM doesn't yet provide a curated UI for doing this type of analysis, the raw data is indeed present and with a few manual steps in Discover we can easily trace a slow Elasticsearch query to a specific HTTP route or Task Manager task in Kibana today, without any changes to Elasticsearch. For many use cases, this should solve the problem.
APM doesn't have a UI that allows you to filter for specific span types, however it's possible to drop into Discover and find Elasticsearch queries originating from Kibana that are slower than 60s with a filter like: span.subtype: "elasticsearch" and span.duration.us >= 60000000
. You can then pull the trace.id
field from a query you're interested in and filter by that trace ID in the APM UI to find the specific Kibana transaction (eg. HTTP route, Task Manager task) this query originated from, along with any other traces of the same transaction.
Because a large portion of our HTTP routes are plugin-specific, this should allow us to easily determine which plugin is triggering ES slow queries in many cases.
For Kibana HTTP routes that are used by many different UIs, we'll need to do some additional work in adding instrumentation to the client-side in order to be able to tease out which UI triggered the underlying slow query. I think we can solve this with a combination of Core-level changes (eg. adding the plugin ID information to the HTTP client exposed to plugins on core.http.fetch
) and plugin-specific changes to add more detailed information about the calling UI (eg. the data
plugin's search utilities).
All this said, this is dependent on the customer being able to turn on the APM instrumentation. There's plenty of things we could do to make this easier for customers (eg. add a --send-support-diagnostics
CLI flag that enables the APM instrumentation and sends data to a central cluster that we control). Other options include writing the APM output data to disk and having the customer send us the logs.
Is there any reason to believe that we'd still need an alternative non-APM solution for this? If not, I think focusing our efforts on improving the existing APM instrumentation would give us the best ROI on our time.
Yesterday we had a meeting with the Elasticsearch team about improving request traceability across the Stack. In short-team, we decided to start leveraging existing APM infrastructure for code instrumentation, request correlation, and traceability UI. It allows Kibana users to debug performance bottle-neck in the user data flow already in the next release. However, it comes with the cost:
In the long term, we consider building a custom solution to overcome the problem listed above. This solution would not be a substitute for all the existing APM functionality, but rather it will complement it with a focus on a specific use-case to correlate a slow Elasticsarch query to a source in Kibana (visualization, SIEM detection engine, alert, etc.) without the necessity to spin up additional servers and with a smaller runtime overhead.
During a standup today, @cyrille-leclerc mentioned the W3C distributed trace context: Baggage as a potential solution for communicating the plugin-specific information from the client-side in the distributed traces. I was told that @dgieselaar knows the details of this. Have we looked into taking advantage of this? If it'd solve the APM solution, could we use it for the long-term solution where this context is included in the appropriate Elasticsearch logs?
Thanks @kobelb . @jasonrhodes and I will be happy to help. I'm not clear on the comments regarding the overhead of instrumentation. Do we have this cost as well if tracing is disabled? I think we could reuse as many distributed tracing standards as possible (W3C Trace Context, Propagation format for distributed trace context: Baggage, a standard/durable tracing API) and enable traces just when needed if the performance impact is too high. This should propagate all the context even when tracing is disabled.
Have we looked into taking advantage of this? If it'd solve the APM solution, could we use it for the long-term solution where this context is included in the appropriate Elasticsearch logs?
The primary challenge we have right now is not the transmission of the context from the client to the server, but the actual passing of the context data itself through all the application layers to the client-side code that initiates requests to the server.
This is a challenge in JavaScript due to it's async, event-driven nature which prevents us from easily being able to determine which user action triggered which Elasticsearch queries without manual code changes. On the server, Node.js's async_hooks
API allows us to associate data with asynchronous "stack frames" in order to connect the initiating context with any asynchronous service calls. The Node.js APM agent uses this API already to provide context correlation.
On the browser, no such API exists so we'll need to manually instrument each call path to ensure context is traced through each layer in our JavaScript services on the client. There is some degree of out-of-the-box global context tracing we can achieve, but for heavily used common services like the data
plugin (which backs many Kibana applications), doing this in an automatic way is not currently possible.
We can definitely leverage the Baggage proposal for transmission of this context, however I don't believe the Baggage proposal will help us with actually delivering this context from the UI layers the JS service layers where we make the requests to Kibana's server side. The OpenTelemetry (aka OpenTracing) protocol that APM already supports does use the Baggage proposal, but at this time APM uses it's own format for sending key-value pairs.
I do think using Baggage for sending key-value pair context information for direct use by Elasticsearch's slow log would be a better option over overloading the x-opaque-id
header. However, I think it'd be acceptable to use x-opaque-id
until the Elasticsearch team can schedule implementing support for the propagating the Baggage
header to the slow log.
I'm not clear on the comments regarding the overhead of instrumentation. Do we have this cost as well if tracing is disabled?
This initial conversation about instrumentation overhead was in reference the overhead of using APM as a storage output for this context data. Now that we know we can leverage Elasticsearch's slow log directly without necessarily needing to use APM, we need to investigate whether or not we can implement this tracing without the same overhead as the APM agent.
It's likely that we'll want to use that async_hooks
API on the server-side for any non-APM tracing. Using this API allows us to avoid a lot of manual instrumentation for context passing, however it does incur some overhead. @mshustov did just find that this API saw a considerable performance increase in Node.js v16.2 (https://github.com/nodejs/node/pull/36394). We plan to upgrade to v16 as soon as it is in the LTS phase, which is targeted for release in October.
In summary:
async_hooks
implementation is unknown at this time. It's likely not zero, but it may be low enough to be acceptable to enable this tracing by default. The alternative is more manual tracing work within Kibana.x-opaque-id
header on requests to Elasticsearch seems like a fine approach until we can get support for the Baggage
header from the Elasticsearch team, which would serve as a better long-term solution to make consumption of this data much more practical + searchable.Thanks for the great explanation, @joshdover. I appreciate you and your team putting such careful thought into this.
@joshdover do you have everything you need from the observability side on this? I know @graphaelli has been involved, but feel free to bring myself and/or @Mpdreamz (APM tech lead) in at any point if we can help support this effort. Thanks for the update!
Elastic Stack has multiple components, while Kibana has got multiple layers of abstraction. In case of slowness, it might be hard to understand whether the problem originated from Kibana itself (excessive number of requests to Elasticsearch or another service, complex computations in Kibana applications, etc) or Elasticsearch server (a complex query, a large time range, misconfigured indices, etc).
When a Kibana user wants to understand why it takes a minute to render a dashboard or why requests to Elasticsearch queries stuck in the request queue, they need to answer the following questions:
What service initiated a request? It might be triggered by user interaction in UI, a task scheduled by Kibana itself, a user hitting a Kibana endpoint, a Stack solution hitting a Kibana endpoint, etc. A Kibana user should be able to track an operation to a source triggering it.
What services did the request go through? Often Kibana plugins do not query the Elasticsearch server directly but use other plugins abstract access to data storage. Without tracing the operation through the services it might be impossible to understand which plugin issues a request or defines query parameters causing performance problems.
What time did we spend in every intermediate service? A single request retrieving data from Elasticsearch might cause several requests performed by Kibana for internal needs (checking the permission level for a given user, storing data in the access log, etc.). These requests can have an impact both on perceived performance for the end-user and the Elasticsearch server handling a large number of incoming requests.
Where did an error happen? An exception might be raised at any level deep in the Kibana code, it should be possible to unwind the error stack trace to the source of the operation and collect parameters causing the problem.
In the current issue, we focus on the use case covering 80% of slowness problems in the Stack: detecting a slow Elasticsarch query and tracing it back to a Kibana plugin triggering the query.
Thereby, we are trying to answer the single question: What service initiated a request?
In the long term, Kibana should provide tooling providing insights about Stack performance, Stack component interactions with the ability to drill to every component in the system. This functionality is out of the scope of the current issue.
A Kibana user should be able:
Kibana is already shipped with a built-in Elastic APM agent. The agent instruments Kibana code to collect runtime stats. The runtime stats are sent to an APM server to be processed and stored in the APM cluster within Elasticsearch.
Use case | |
---|---|
What service initiated a request? | No, requires manual instrumentation to correlate a transaction with a user actions in UI or plugin triggering a request |
What services did the request go through? |
|
What time did we spend in every intermediate service? | Requires manual instrumentation to mark plugin-specific operations in an async stack trace. |
Where did an error happen? | Requires manual instrumentation to add plugin-specific context to the async stack trace. |
Cons:
Right now, the APM agent doesn’t support integration with Elasticserch slow query logging, async tracing in the browser. These are critical requirements for the current issue and most likely, it will not be possible to fix these drawbacks quickly That’s why it may make sense to start with a custom solution and consider switching to the APM agent later.
Below I will use execution context
term with the meaning of a container for runtime execution meta data
to emphasize it's not the same as an APM transaction.
Use case | |
---|---|
What service initiated a request? | Supports creating execution context manually in server- and client-side environments, carry over execution context information between Kibana parts and Stack components. |
What services did the request go through? | not supported |
What time did we spend in every intermediate service? | not supported |
Where did an error happen? | not supported |
Cons:
execution context
with api provided by Core. Returned value is opaque to the plugin.
const executionContext = createExecutionContext({
name: plugin name,
type: plugin element name,
url?: /path/to/element/in/ui
description?: human readable description
})
execution context
should be passed to the Kibana server. Right now we have two options:
async_hook
API in the browser environment (for example, zone.js) to store and retrieve execution context
before transmitting a request to the Kibana server over the wire. To investigate: how much overhead the package adds and what network libs should be instrumented (kibana.http
, XMLHttpRequest
, axios
, anything else?)execution context
manually through all the layers of abstractions in Kibana and set it as a request header before issuing a request to the Kibana server. To investigate: how much refactoring it would require to pass the execution context
meta information, maybe it’s not that hard considering expressions
plugin passes request parameters via ExecutionContext for every request.execution context
is serialized and added as request's Baggage
header every time Kibana issues a request to the Kibana server.execution context
from Baggage
header. The value is stored in the Asynchronous Local Storage until the server responds to the request. As a bonus, Core could create an APM span populated with data retrieved from execution context
execution context
from Asynchronous Local Storage whenever a plugin issues an Elasticsearch request and attaches the value to x-opaque-id
header. execution context
can be emitted to Kibana elasticsearch service
logs for debugging purposes to avoid the necessity to enable slow logs in the Elasticsearch server.createExecutionContext
on the server-side as well, it can be used by server-side plugin initiating an asynchronous operation (such as task_manager). In this case, Core stores the created execution context
value in Asynchronous Local Storage and attaches it to Elasticsearch request as described above.@kobelb @joshdover and others. I'd appreciate your input on the implementation options and the Open questions section.
My thoughts:
Should Kibana enable tracing by default, or should it be configurable?
Depends on the final implementation overhead. Maybe it should be enabled by default to be accessible by Kibana or Elastcisearch. Elasticsearch slow logging is configured via logging level
Where should Kibana write the tracking information?
Required: x-opaque-id
or another header when requesting Elasticserch service.
Optional: Kibana logs
Should Kibana provide UI for the tracing functionality?
not at the current stage
What minimal set of data allows tracing the request back to a source?
- What plugins should be reported as sources
in order of priority:
- Should Kibana support creating nested execution contexts?
If we encounter such a use case.
- Who is responsible for instrumenting the Kibana code?
To speed up the process, it makes sense to have Core team do it, in advance consulting with the owners of the code.
@jasonrhodes @graphaelli @Mpdreamz does it make sense for you if the Core team implements a custom solution instead of relying on APM for this specific use case of tracing an Elasticsearch query to a Kibana source? What do we lose by giving up the APM solution?
Short reply will go in depth deeper tomorrow but I feel strongly we should be eachother's worst/best customer.
If we built something custom now it'll be a missed opportunity to invest in a deep stack integration.
For instance @SylvainJuge explored first class support for APM within Elasticsearch at several levels. Even if Elasticsearch does not ship with APM anytime soon we could look at it supporting trace headers and persisting these to Elasticsearch logs in addition to x-opaque-id.
This way even if APM is not installed kibana can still manually correlate traces rather then building a custom scheme for this.
Thanks @Mpdreamz . I hope that we will find a solution to adopt distributing tracing capabilities on our stack as it's the de facto standard in our industry nowadays. Hopefully, our tracing solution will do the job. We can also consider tracking some Enhancement Requests if needed. Moreover, if the existing distributed solutions don't fit our needs for the moment, it could be possible to use standard Distributed Tracing APIs wiring our own implementation if we have specific characteristics (e.g. special context propagation constraints...).
As @Mpdreamz said, a custom solution, in addition to being orthogonal to the industry trend, would be a missed opportunity to productize our work.
I'm completely fine with implementing a custom solution in the short term. We'll need this information integrated with APM eventually, and we have every intention of using APM whenever relevant, but we have a specific need that we need to address. I'm fully on board with whatever implementation path has the least resistance, as long as it doesn't exclude us using APM in the future.
I'm in agreement with a majority of your answers to the open questions, @mshustov, with a few exceptions/clarifications:
Should Kibana enable tracing by default, or should it be configurable?
Depends on the final implementation overhead. Maybe it should be enabled by default to be accessible by Kibana or Elastcisearch. Elasticsearch slow logging is configured via logging level
:+1: to enabled by default, whenever there isn't a performance penalty. I don't think we need to follow Elasticsearch's precedent and only enable this when a user opts into this functionality.
What minimal set of data allows tracing the request back to a source?
a Kibana component name. should be part of public API. A link to component (link to a dashboard, visualization, alert) to inspect it in UI. Human-readable description of component (a dashboard title, alert name)
I don't think we're always going to have a link. For example, it's possible for a background task to generate a query to Elasticsearch. Can we support both a link and/or a "component identifier"? This would let us put the background task UUID into the "component identifier"
I'm completely fine with implementing a custom solution in the short term. We'll need this information integrated with APM eventually, and we have every intention of using APM whenever relevant, but we have a specific need that we need to address.
+1 to going with the custom implementation in the short-term, unless there is a very easy way for the Node.js APM agent to give us these execution context tracing features on the server-side, even when we're not shipping data to an APM server + not taking span measurements. We should probably at least talk with @trentm and team about the possibility here before moving forward with the custom implementation.
That said, we'd likely want this context for other features like execution context for logs as well. I'm not sure using APM as a central building block of this infrastructure is necessarily a good idea since we'll need to coordinate changes across the two repositories and teams. It's worth noting that we should be able to leverage this custom context tracing with data sent from APM as well.
Like Brandon, I also agree with your answers to most of the open questions, just a few comments:
Should Kibana support creating nested execution contexts?
I suspect we're going to run into this case pretty quickly, but I'm fine on deferring until we do. A spec for how it might work may be helpful to be sure that the design is coherent.
What minimal set of data allows tracing the request back to a source?
- a Kibana component name. should be part of public API.
- A link to component (link to a dashboard, visualization, alert) to inspect it in UI.
- Human-readable description of component (a dashboard title, alert name)
I think for UI objects represented by a Saved Object, a type and ID would be quite helpful as well.
Elasticsearch master
switched the slow logs over to ECS format and no longer includes X-Opaque-Id
.
> POST /logs/_search HTTP/1.1
> Host: localhost:9200
> Content-Type: application/json
> X-Opaque-Id: search
> Accept: */*
> Content-Length: 14
| {
| "size": 1
| }
{"@timestamp":"2021-06-03T17:51:52.971Z", "log.level": "WARN", "id":"search", "message":"[logs][0]", "search_type":"QUERY_THEN_FETCH", "source":"{\\\"size\\\":1}", "stats":"[]", "took":"722.9micros", "took_millis":"0", "total_hits":"0 hits", "total_shards":"1" , "service.name":"ES_ECS","process.thread.name":"elasticsearch[runTask-0][search][T#11]","log.logger":"index.search.slowlog.fetch","event.dataset":"elasticsearch.index_search_slowlog","elasticsearch.cluster.uuid":"_dwxIGdkSWOebht9LUaPSA","elasticsearch.node.id":"179RfCjATYCwP52D8Zu79g","elasticsearch.node.name":"runTask-0","elasticsearch.cluster.name":"runTask"}
This ties into our https://github.com/elastic/ecs-logging efforts and makes Elasticsearch ready for deeper integration with APM and Logs UI once Elasticsearch starts supporting w3c traceheaders & Elastic Agent makes it super easy to ship ECS logs.
X-Opaque-Id
is/should be on the way out as custom trace header.
I'd advocate heavily against putting X-Opaque-Id
back in master
too in Elasticsearch. This would give it a special meaning again in the context of Elasticsearch's ECS logs.
A custom short term solution is therefor also possibly short lived. Or worse we bless X-Opaque-Id
as a trace header which sends the wrong signals IMO and would make it much harder to sunset in logging post 8.0 since Kibana heavily relies on it.
I'm not sure using APM as a central building block of this infrastructure is necessarily a good idea since we'll need to coordinate changes across the two repositories and teams.
I'm less afraid of this, you've clearly identified some gaps in both the Agents and possibly ECS-logging and we should investigate support for Baggage
headers.
That said I lack visibility into timelines here but again this feels like a win/win for both products if we design a solution together.
and no longer includes X-Opaque-Id. ... X-Opaque-Id: search ...
{"@timestamp":"2021-06-03T17:51:52.971Z", "log.level": "WARN", "id":"search", ...
@Mpdreamz Is that "id" field from the x-opaque-id
header? (I'm totally ignorant on ES slowlogs.) If not, is there some mechanism to send a header to ES queries that could land in slowlogs?
It is 🙀! Can't believe I missed it.
If not, is there some mechanism to send a header to ES queries that could land in slowlogs?
X-Opaque-Id currently fills that role, my previous note on looking to switch to w3c trace headers for this purpose still stands though.
unless there is a very easy way for the Node.js APM agent to give us these execution context tracing features on the server-side, ...
I had a bit of a play with this, to see what I could easily get the Node.js APM agent to do. The idea is to use the W3C baggage
header to propagate execution context
info between the Kibana front-end and back-end, maintain that on the Node async context, and make the info available whereever is appropriate for debugging slow ES requests.
(My changes are here: https://github.com/elastic/apm-agent-nodejs/compare/trentm/baggage-play)
There is an "esapp.js" which is an Express app that represents the Kibana backend: it uses APM and talks to ES (via its single '/' endpoint to make a query). For starters I have cheated and I am not using the W3C's serialization of baggage
-- I am using JSON.stringify|parse
instead.
I've updated the APM agent to propagate a given baggage
header, so it could receieve that from a RUM APM agent updated to support baggage
, and it is passed through to the ES query (if/when ES supports doing anything with that header). For example, say the Kibana front-end makes a query like this to the back-end:
curl -i localhost:3000/ -H tracestate:foo=bar -H baggage:'{"spam":"eggs"}' -x traceparent:00-110a729f520afc198f9992b6a9c2b738-1144edb970081354-01
Any resulting ES queries made by the backend for that transaction get these headers (i.e. the trace id and baggage are maintained):
XXX outgoing instrumented request: POST /_search {
authorization: 'Basic YWRtaW46Y2hhbmdlbWU=',
'user-agent': 'elasticsearch-js/7.12.0 (darwin 20.4.0-x64; Node.js v12.22.1)',
'x-elastic-client-meta': 'es=7.12.0,js=12.22.1,t=7.12.0,hc=12.22.1',
'content-type': 'application/json',
'content-length': '35',
traceparent: '00-110a729f520afc198f9992b6a9c2b738-f10bfdbbc10f6f8a-01',
tracestate: 'foo=bar',
baggage: '{"spam":"eggs"}'
}
I added an API to the APM agent to get (apm.getBaggageEntry()
) and set baggage entries. Modulo Martijn's comments above, one can use the ES client "request" event to set the x-opaque-id
header for ES requests:
// Add `x-opaque-id: <execution context>` header to ES requests.
esclient.on('request', function (_err, event) {
const ec = apm.getBaggageEntry('spam')
if (ec) {
event.meta.request.params.headers['x-opaque-id'] = ec
}
})
which results in:
XXX outgoing instrumented request: POST /_search {
...
'content-length': '35',
'x-opaque-id': 'eggs', // <---- HERE
traceparent: '00-110a729f520afc198f9992b6a9c2b738-b3d7322e336b0044-01',
tracestate: 'foo=bar',
baggage: '{"spam":"eggs"}'
}
Perhaps Kibana could have a utility to set execution context info on a well-known baggage key, e.g. kibanaExecutionContext
:
function setExecutionContext ({name, type, url, desc}) {
const exectx = { name, type, url, desc }
apm.setBaggageEntry('kibanaExecutionContext', JSON.stringify(exectx))
}
Then Kibana code could call it (I don't know Kibana well enough to know if/where this would best fit in):
app.get('/', (req, res) => {
setExecutionContext({ name: 'doAQuery', type: 'someType', url: req.url })
esclient.search({
...
Then any outgoing requests will send that baggage along. This utility could be improved to support stacked execution contexts, subject to limits in the size of the 'baggage' header.
If using the ES slowlog isn't an option or appealing, one can use the ES client's "request" and "response" o11y events to measure slow ES logs itself:
// Warn on "slow" ES queries, include execution context.
const WHAT_IS_SLOW = 10 // ms
esclient.on('request', function (_err, event) {
const { meta } = event
if (!meta.context) {
meta.context = {}
}
meta.context.startTime = Date.now()
})
esclient.on('response', function (_err, event) {
const latency = Date.now() - event.meta.context.startTime
if (latency > WHAT_IS_SLOW) {
console.warn(`ZOMG this ES query was SLOW:\n latency: ${latency}ms\n kibanaExecutionContext: ${apm.getBaggageEntry('kibanaExecutionContext')}`)
}
})
which looks like the following in Kibana logs:
XXX outgoing instrumented request: POST /_search {
...
traceparent: '00-e885d42982ca6b7b2cc0ec6aa0a3fd2e-e8b1126400f28303-01',
tracestate: 'es=s:1',
baggage: '{"kibanaExecutionContext":"{\\"name\\":\\"doAQuery\\",\\"type\\":\\"someType\\",\\"url\\":\\"/\\"}"}'
}
ZOMG this ES query was SLOW:
latency: 16ms
kibanaExecutionContext: {"name":"doAQuery","type":"someType","url":"/"}
even when we're not shipping data to an APM server + not taking span measurements.
Currently the Node.js APM agent expects there to be an APM server to talk to.
However, since v3.14.0 of the agent, it will work fine. It will complain
(log.error
) every ~30s, buffer ~1000 events, and otherwise carry on tracing.
There is some small work we should do there.
Minimizing overhead: Using the APM agent for Kibana back-end async context tracking like this will have some more overhead than a custom tracker that uses Async Local Storage. There are somethings that could be done to minimize that:
disableInstrumentations
to limit it to just http, elasticsearch, hapi
@trentm thank you for the detailed feedback.
Add support for explicitly disabling sending to APM server, which will save on time spent serializing those events.
I don't know the apm agent internals, but we might need to refactor buffer logic to FIFO as well. From the maxQueueSize description: If the queue is full, events are rejected which means you will lose transactions and spans.
turn off metrics, breakdown metrics, any stacktrace collection tweak disableInstrumentations to limit it to just http, elasticsearch, hapi
Kibana can disable them via apm config, no additional work required from the APM team.
To summarize all the written above, we can split the work into the next stages:
APM nodejs agent
. We will have to keep the custom implementation for the Kibana browser app until the apm-rum
agent adds support for async stack traces in the browser.x-opaque-id
header in favor of tracing headers.APM rum
agent add support for asynchronous trace stacks in the browsers out of the box, we might consider switching to it from our custom solution.Any suggestions or objections to this plan? @Mpdreamz @trentm @kobelb @joshdover
@trentm do you want me to create appropriate tasks for the APM nodejs agent
?
Thanks all for the careful thought being put into this. I totally agree that a fully integrated solution on top of APM would be ideal and a win for APM customers as well. I'm open to collaborating towards that as an end-goal. It seems what we need to figure out at this point is roadmap coordination between the Kibana Core and APM Agent teams to determine whether or not we need a short-term solution on the Kibana side.
- Kibana can start with a custom solution that supports context propagation Kibana browser app --> Kiban server --> Elasticsearch service, as described in #97934 (comment)
A non-APM solution on the client side context tracing seems unavoidable at this time, unless the Kibana Core team contributes directly to the RUM agent to integrate a context tracing solution like Zone.js. This seems like a pretty high difficulty feature to add to the agent and something that the agent team will want to get right the first time.
On the Kibana side, I'd rather we opt towards the manual context passing approach until we have something more sophisticated available from the RUM agent, given that the RUM agent can commit to providing this feature at some point in the medium-term (6 months?). I say this because I suspect that a manual approach will require less investment up front to get something working for 7.15 and avoids the need to do much, if any, performance testing of this tracing solution in the browser.
- As soon as the APM nodejs client is able to work without having to run the APM server, we can switch from the custom solution for the Kibana server to the solution based on
APM nodejs agent
. We will have to keep the custom implementation for the Kibana browser app until theapm-rum
agent adds support for async stack traces in the browser.
Given @trentm's notes, this seems pretty close already. Trent, is it possible to get a rough idea of how implementing this could fit into your roadmap:
- Add support for explicitly disabling sending to APM server, which will save on time spent serializing those events.
- Look into a Async Local Storage-based context manager for the APM agent. It should eventually grow one.
- When Elasticsearch adds a deeper integration with APM events, we will refactor communication protocol between Kibana server < -- > Elasticsearch server to get rid of
x-opaque-id
header in favor of tracing headers.
@Mpdreamz It seems that for the short-term it makes sense for Kibana to proceed with using x-opaque-id
in order to get something out for 7.15. Agreed that using the proper tracing headers in the future would be ideal once they're available from Elasticsearch.
Let me back up a bit to make sure I understand some goals/decisions.
A large part of this effort is about getting some tracing info ("where did this slow ES query originate?") without relying on having an APM server setup. Is this based on the assumption/understanding/agreement that in many scenarios it isn't practical or likely that the owner of that stack deployment will setup to use the APM system? And that is still true for something like --send-support-diagnostics
that @joshdover mentioned?
add a --send-support-diagnostics CLI flag that enables the APM instrumentation and sends data to a central cluster that we control
I am not challenging, just trying to understand.
Add support for explicitly disabling sending to APM server, which will save on time spent serializing those events.
I don't know the apm agent internals, but we might need to refactor buffer logic to FIFO as well. From the maxQueueSize description:
If the queue is full, events are rejected which means you will lose transactions and spans.
No. At least the way I think we'd implement "hey APM agent, do your thing, but there is no APM server for now", there would be no buffering of events at all.
2\. As soon as the APM nodejs client is able to work without having to run the APM server, we can switch from the custom solution for the Kibana server to the solution based on `APM nodejs agent`. ...
To be clear, the current Node.js APM agent will currently work without an APM server, it will just log.error a bit and have some wasted overhead.
Are you thinking of using baggage
(both the header, and generic Baggage-related APM agent APIs) similar to what I described above to transfer context info between client and server? If so, that would require that the client-side (Kibana browser app) be sending context info in the baggage
header and Baggage format. I'm not sure if that would require some support added to the RUM APM agent. (There is otel baggage util TypeScript code here, if that would help.)
There are two separate Node.js APM agent pieces of work here:
Given @trentm's notes, this seems pretty close already. Trent, is it possible to get a rough idea of how implementing this could fit into your roadmap:
- Add support for explicitly disabling sending to APM server, which will save on time spent serializing those events.
- Look into a Async Local Storage-based context manager for the APM agent. It should eventually grow one.
I will bring it up with my team so we can discuss prioritization and get back here. Running without APM server and baggage support I expect will be fine. Adding Async Local Storage-based context management is probably unlikely for the short term.
given that the RUM agent can commit to providing this feature at some point
@joshdover, did you mean "the RUM agent cannot commit"?
add a --send-support-diagnostics CLI flag that enables the APM instrumentation and sends data to a central cluster that we control
Yes, Elasticsearch already provides slow query
tracing. From the UX point of view, it makes sense to focus on end-to-end integration for the entire stack because the proposed alternative has some drawbacks:
debug
mode explicitly - it's hard to sport and investigate sporadic problemsTo be clear, the current Node.js APM agent will currently work without an APM server, it will just log.error a bit and have some wasted overhead.
As I understand, the APM client hasn't been created with this use case in mind. I would wait until this mode is officially supported. Even if if just log.error a bit and have some wasted overhead.
, it adds unnecessary noise in logging output, which might increase log storage size and hurts the usefulness of the logs.
Are you thinking of using baggage (both the header, and generic Baggage-related APM agent APIs) similar to what I described above to transfer context info between client and server?
Probably, the mechanism of data transfer between the Kibana browser app and the server app is an internal implementation detail. I don't think that Add baggage support.
is a mandatory requirement for the first stage. We can switch to it later, once the APM nodejs agent supports it.
On the Kibana side, I'd rather we opt towards the manual context passing approach until we have something more sophisticated available from the RUM agent,
I started looking at zone.js
, even though it's leveraged by serious projects like opentelemetry there are a few problems that might be considered as critical for this option:
async/await
syntax. Of course, we can work it around with downgrading the compilation target, but the problem doesn't seem to be fixable until zone.js
-like functionality is provided natively.@joshdover, did you mean "the RUM agent cannot commit"?
No, I meant that we should probably avoid building our own async context tracing solution in the browser if the RUM agent can commit to providing one for us in the future. That way we don't duplicate any efforts.
However, given the current state of the art though, it doesn't appear that there may be any viable solution to this yet. I'd love to find out more about why the TC39 proposal for Zones was withdrawn and if there's a replacement proposal in the works.
Regarding changes for the Node.js agent:
x-opaque-id
header itself), this will probably get a lower priority. -> https://github.com/elastic/apm-agent-nodejs/issues/2102
- Requires the APM server running
- Requires deployment of APM clusters in Elasticsearch
- Management overhead for APM data stored in Elasticsearch
- Doesn’t integrate with Elasticserch slow query logging
- Doesn’t support alternative output destinations (Kibana logs, Kibana slow logs, telemetry, etc.)
These should be addressed with https://github.com/elastic/apm-agent-nodejs/issues/2101 and by adding traceparent-based log correlation to both Kibana and Elasticsearch.
Using log correlation, it will be possible to correlate the logs form Kibana and Elasticsearch to a particular request.
The way I see it, this is more powerful than propagating proprietary/custom context to the baggage
header.
As mentioned by others already, I'd also like to see ES to move away from X-Opaque-Id
in favor of adding the trace.id
from the traceparent
header to the logs.
If and when APM is enabled (I'm advocating for turning it on by default on cloud but that's a different discussion), we wouldn't even necessarily need the slow log but could query the APM index for slow ES queries. It would then be possible to view the full request from frontend to backend so that we exactly know where a slow query originates from, rather than just which component.
- Doesn’t support async tracing in the browser
Seems like this needs manual context propagation in the short to mid-term. But instead of doing something completely custom for Kibana, I'd suggest using the RUM agent's API to do the context propagation. I believe those APIs don't exist today but I think it's worth investing in adding them rather than doing something outside of APM.
These should be addressed with elastic/apm-agent-nodejs#2101 and by adding traceparent-based log correlation to both Kibana and Elasticsearch.
@felixbarny we are going to switch to APM solution as soon elastic/apm-agent-nodejs#2101 is done. As I can see, the APM agent team is planning to work on it during v7.15
. In the meantime, Kibana can start with a custom solution not being blocked by APM agent improvements. I believe the APM-based solution will be able to reuse most of the logic introduced during https://github.com/elastic/kibana/issues/101587, namely: Core code instrumentation, Kibana plugin code instrumentation, integration tests.
I believe those APIs don't exist today but I think it's worth investing in adding them rather than doing something outside of APM.
Agree, until it's provided by the APM RUM agent, we will have to use a custom solution. Is there an issue to track the progress of the enhancement?
I'd also like to see ES to move away from X-Opaque-Id in favor of adding the trace.id from the traceparent header to the logs.
It's planned to be done at the later stage as well https://github.com/elastic/kibana/issues/101708 Let me know if are concerned about the order of improvements or you can see a better way to organize this work.
As I can see, the APM agent team is planning to work on it during v7.15. In the meantime, Kibana can start with a custom solution not being blocked by APM agent improvements.
What's your timeline to add that to Kibana? If we'd add that in the early 7.14 or early 7.15 time frame, can we avoid the interim solution?
Let me know if are concerned about the order of improvements or you can see a better way to organize this work.
I fear that working on a custom solution now will make it much harder to migrate to an APM-based solution in the future.
Agree, until it's provided by the APM RUM agent, we will have to use a custom solution. Is there an issue to track the progress of the enhancement?
I don't think it's completely clear what functionality you'd need from the RUM agent. Could you open a ticket on https://github.com/elastic/apm-agent-rum-js or another Kibana issue to elaborate on the things you need?
@mshustov we reprioritized https://github.com/elastic/apm-agent-nodejs/issues/2101 and will address this next (after current in-progress stuff on Node.js Agent), so you can hopefully move forward with that very soon.
We've spent several days talking with @felixbarny about implementing this functionality and improving the observability for the whole Stack at the same time. I updated the https://github.com/elastic/kibana/issues/101587 description to reflect the current plan. Feel free to chime in.
The Elasticsearch performance team is interested in analyzing the timing of requests issued by Kibana to make sure any changes in the Elasticsearch server do not lead to regression in Kibana UI performance.
Their current approach with relying on APM RUM agent capturing queries after a page load does not work well, since Kibana data management might perform a search request with a delay after the page load event, so APM agent cannot associate a request with page load
transaction. Also data
plugin sends search requests
in a batch, which might break the APM RUM agent instrumentation of the outbound requests.
Yesterday we had a meeting with @gingerwizard, @danielmitterdorfer, @vigneshshanmugam, @wylieconlon, @ppisljar about improving the Kibana client-side instrumentation with the APM RUM agent. @gingerwizard is working on collecting requirements for performance metrics the Elasticsearch performance team wants to get from the Kibana side. Depending on the requirements we will start an investigation for the Kibana side instrumentation.
Some notes from the meeting:
If we are interested in tracing of search queries
only, the AppServices team might instrument the search session
mechanism to create a parent transaction
for all the spans
wrapping search requests issued while a search session is active. The result won't be very precise but might be good enough to reflect the influence of ES performance on Kibana UI.
A more detailed tracing mechanism might require Kibana plugins to instrument their data management code manually.
@lukeelmers @rudolf @kobelb
traceparent
standard be usable in the longer term "right solution" (distributed tracing)? My high level understanding is that this is possible, in which case, awesome!Context: Recent important escalation:
@arisonl I think it depends on what specific question we are trying to answer.
If the question is "Today can we trace slow Elasticsearch queries to their origin in Kibana?", the answer is yes, in 8.1+, you can trace queries back to Dashboards/Visualizations/Lens/Discover/Alerts/Maps using Kibana's execution_context
feature (enabled by default), which passes this info along in the x-opaque-id
header. (Docs)
If the question is "Today can we trace slow Elasticsearch queries back to a specific user?", the answer is yes, with some caveats (You need to be using audit logs, and can't be using server.requestId.allowFromAnyIp
or server.requestId.ipAllowlist
settings in Kibana. Long story, but I have a writeup on this I can point you to if it helps.)
If the question is "Today can we trace slow Elasticsearch queries using traceparent
?", the answer is no, we're still using x-opaque-id
with the execution_context
feature. To support this, APM would need to be enabled on a Kibana deployment, and we'd have to do some work on our end to make sure everything still works as expected and audit log correlation doesn't break.
(@Rudolf @kobelb keep me honest if I've missed anything on this) 😉
What are the main blockers that prevent you to use APM for these use cases? We've made several improvements in the Node.js agent to improve its performance and to make it possible to use it only for context propagation purposes, without it reaching out to the server. In the meantime, Elasticsearch has also adopted APM so that we can have full distributed tracing across Kibana and Elasticsearch, which should IMO be superior to manual correlation of a slowlog file. We also do have the vision of enabling APM on cloud by default. If we don't all agree on that we should have another discussion to align our vision.
The APM Team has a big interest in making this a success and we're here to support you.
@felixbarny thank you, this sounds great. Some problems related to the APM approach were summarized earlier in this thread but this is over a year old now. And APM distributed tracing does sound like the right end state. We have many requests asking to provide a way for ES queries to be traceable to user actions in Kibana or system initiated tasks (such as alerts) and so I think that we should pick this conversation up and review where we stand and what the most reasonable next steps would be. In addition, we also have use cases that involve being able to trace back in cross cluster search scenarios.
we should pick this conversation up and review where we stand and what the most reasonable next steps would be
I agree. It's hard to keep track of this long thread. Maybe we should close this and create a new one which describes the current state, the remaining issues, and the next steps. Who can own doing that?
we also have use cases that involve being able to trace back in cross cluster search scenarios.
Generally, CCS should just work™️ with APM.
At this point I don't think we are blocked on anything here. It mostly comes down to:
traceparent
doesn't break audit logs
execution_context
service is enabled by default, so we would need to ask customers to set up APM if they want to do this type of query tracing moving forward (I'm not really too worried about that, but @arisonl maybe you have thoughts on it)
Kibana executes a lot of queries against Elasticsearch. Some of them originate from internal sources (task-manager claiming tasks, queries that are run on Kibana startup, etc.) and some of them are a result of a user's interaction with Kibana (opening a Visualization, using the Index-Management UI, etc).
Some of these queries can be slow and show up in the Elasticsearch slow log. However, when they do, it's not obvious what caused them to occur in the first place. This makes it very difficult for administrators to do anything to prevent them from occurring in the future.
It'd be incredibly helpful for us to be able to correlate a slow Elasticsearch query to the specific usage of a feature. For example, it should be possible to correlate an Elasticsearch query to a specific Visualization that's embedded in a Dashboard. This would allow the administrator to take corrective action.
Orthogonal or related issues
https://github.com/elastic/kibana/issues/77214 proposes custom request headers for the sake of usage analysis; however, there's potentially some overlap with a solution here.
https://github.com/elastic/kibana/issues/60122 proposes including the "execution context" to all Kibana log entries, but it doesn't propose forwarding these to Elasticsearch for correlation.