Open jtibshirani opened 3 years ago
Pinging @elastic/es-core-infra (Team:Core/Infra)
Thanks @jtibshirani for the write up. @pgomulka is going to spend a bit of time investigating how we would implement this, and we'll discuss at the next core/infra team meeting.
I am not sure we can achieve this with current infrastructure. Accessing X-Opaque-ID from other threads would require a public access to ThreadLocal.getMap(Thread t) or Thread.threadLocals - which are package private to java.lang package. I am also not sure that would be a great design anyway to access foreign threads private parts..
@original-brownbear and @ywelsch Do you think this is achievable with ThreadContext
? Or maybe you think there might be a totally different approach to this?
You would also first have to get ahold of the ThreadContext
associated with the thread. And you would have to get ahold of the Thread
object, as HotThreads
is only operating on ThreadInfo
, not Thread
. From a quick look, the simplest way to implement something along the lines of what @jtibshirani asked would be to create a static concurrent map in ThreadContext
, keyed by thread id (long Thread.getId()
), update it whenever ThreadContext.threadLocal
is updated, and then get the relevant information in HotThreads
by doing a lookup on this map using ThreadInfo.getThreadId
.
Hi @ywelsch, I had a chat with @pgomulka on this today and I'm wondering if it would be possible to avoid creating a shared concurrent map by modifying the Thread names when we stash and restore the ThreadContext? Essentially, all we really need to do is show the X-Opaque-ID in the thread dump and we can use the Thread name field for it. Right now a thread name would be something like: elasticsearch[elastic][generic][T#2]
, which comes from the name of the thread pool I think, so the idea is to:
elasticsearch[elastic][generic][T#2]: X-Opaque-ID: 123
I'm new here so I may not understand the full scope of this kind of a change, but I can see some advantages if it was possible. We wouldn't have to run additional atomic operations of every request for an infrequent API call, HotThreads doesn't need to be modified and having the X-Opaque-ID in the thread name may come in handy during debugging, or while using other external tools like JFR/MissionControl or VisualVM.
Note that hot_threads
is more than just a thread dump. It takes multiple samples for each thread in order to identify common parts in the stack traces over a period of time for a given thread. HotThreads
currently assumes that the thread name is stable, just picking the name from one sample. If X-Opaque-ID
is encoded into the thread name, and it changes from sample to sample, couldn't this now mislead users into thinking that the thread had spent all this effort in the context of a given opaque id whereas this was only applicable to a (small) subset of the samples?
Also, wouldn't the rename require you to parse the current thread name first, e.g. in order to find out what thread pool it belonged to (write/generic/...) so that you can emit the correct updated name?
I agree that having the X-Opaque-ID
in the thread name for debugging / external tools could be handy as well (unless these tools operate on multiple samples as well and assume a stable thread name to identify identical threads). I'm not sure what a good practice here is. I don't hold a strong opinion on this either way.
Note that
hot_threads
is more than just a thread dump. It takes multiple samples for each thread in order to identify common parts in the stack traces over a period of time for a given thread.HotThreads
currently assumes that the thread name is stable, just picking the name from one sample. IfX-Opaque-ID
is encoded into the thread name, and it changes from sample to sample, couldn't this now mislead users into thinking that the thread had spent all this effort in the context of a given opaque id whereas this was only applicable to a (small) subset of the samples?
This is a very valid point, thanks for bringing it up. We would have to extract all X-Opaque-Id
s as we capture the thread stacks from the thread name and clean-up the opaque id from the thread name when we report it. I think it's similar to what we would need to do if we used a map, but with less String manipulation and clean-up.
In a sense, if we implemented this feature it would have to be taken with a grain of salt anyway. If there were 10, 50ms API calls on the same executor thread, for the default duration that the HotThreads sampler thread sleeps (500ms), we'd miss at least 9 of those Opaque-Ids with both solutions we have here. The only way to get 100% accurate would be via sliding time window id tracking in a list (per thread), with 1 second window, assuming nobody uses more than 500ms HotThreads interval. I think that kind of tracking would be very expensive inside ThreadContext.
In short, I think with short running requests we'll always be misleading the users on the X-Opaque-Id. Perhaps the solution is to word the messaging around the Id in the report so that it speaks to this potential uncertainty?
Also, wouldn't the rename require you to parse the current thread name first, e.g. in order to find out what thread pool it belonged to (write/generic/...) so that you can emit the correct updated name?
My thinking here was to simply append the X-Opaque-Id to whatever the thread name already is. By default it's set by the pool, so we'd add to it. However, now that you mention this, I see a way to get around having to store/restore the original thread name in a ThreadLocal, by doing String.split(threadName, 'X-Opaque-Id') on the restore path and simply reverting the name to the first String in the resulting array.
When searches are causing high CPU load, it's common to consult
hot_threads
to find the source of expensive searches. Working backwards to the source can be difficult, as it requires detailed knowledge of how searches are executed, often at the Lucene level.We support a special header
X-Opaque-Id
that allows clients to tag a request with context information. Kibana is currently working on passingX-Opaque-Id
on all search requests, to surface the application and component that initiated the request (https://github.com/elastic/kibana/issues/101587). The header is already included in search slow logs and tasks output. Could we also add it tohot_threads
to help debug the source of expensive queries?Note: I'm not familiar with the
hot_threads
implementation and am not sure how feasible this actually is.