Bedrock sender sleeping in inference_utility thread pool

frensjan commented 4 days ago

Elasticsearch Version

8.15

Installed Plugins

No response

Java Version

17

OS Version

Debian bookworm

Problem Description

Probably nothing broken, but confusing: after upgrading to ES 8.15 we're seeing a thread continuously occupying the inference_utility thread pool by the AmazonBedrockRequestExecutorService. It seems to be sleeping in handleTasks().

An example stack trace:

   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(java.base@17.0.12/Native Method)
        at java.lang.Thread.sleep(java.base@17.0.12/Thread.java:344)
        at java.util.concurrent.TimeUnit.sleep(java.base@17.0.12/TimeUnit.java:446)
        at org.elasticsearch.xpack.inference.external.http.sender.RequestExecutorService.lambda$static$0(org.elasticsearch.inference@8.15.2/RequestExecutorService.java:66)
        at org.elasticsearch.xpack.inference.external.http.sender.RequestExecutorService$$Lambda$5012/0x00007fa300c68ff8.sleep(org.elasticsearch.inference@8.15.2/Unknown Source)
        at org.elasticsearch.xpack.inference.external.http.sender.RequestExecutorService.handleTasks(org.elasticsearch.inference@8.15.2/RequestExecutorService.java:240)
        at org.elasticsearch.xpack.inference.external.http.sender.RequestExecutorService.start(org.elasticsearch.inference@8.15.2/RequestExecutorService.java:192)
        at org.elasticsearch.xpack.inference.external.http.sender.AmazonBedrockRequestExecutorService.start(org.elasticsearch.inference@8.15.2/AmazonBedrockRequestExecutorService.java:19)
        at org.elasticsearch.xpack.inference.external.amazonbedrock.AmazonBedrockRequestSender.lambda$start$0(org.elasticsearch.inference@8.15.2/AmazonBedrockRequestSender.java:89)
        at org.elasticsearch.xpack.inference.external.amazonbedrock.AmazonBedrockRequestSender$$Lambda$5018/0x00007fa300c750b0.run(org.elasticsearch.inference@8.15.2/Unknown Source)
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(org.elasticsearch.server@8.15.2/ThreadContext.java:917)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.12/ThreadPoolExecutor.java:1136)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.12/ThreadPoolExecutor.java:635)
        at java.lang.Thread.run(java.base@17.0.12/Thread.java:840)

It's just one thread out of the 10, losing 10% capacity there is probably not to big of an issue. The annoying thing is that we use a PromQL query to monitor thread pools of ES as provides by the Elasticsearch Exporter that fires as the core size of the pool is 0.

max(elasticsearch_thread_pool_active_count) by (type) / avg(elasticsearch_thread_pool_threads_count > 0) by (type)

Steps to Reproduce

Just use the threads cat API to see that this pool always has at least 1 thread active. Taking a thread / heap dump shows AmazonBedrockRequestExecutorService and related.

Logs (if relevant)

No response

frensjan commented 4 days ago

What perhaps could also help here is that if the _nodes/stats endpoint exposes the max size of the thread pool, it would be clear that there is still capacity in the pool.

elasticsearchmachine commented 4 days ago

Pinging @elastic/ml-core (Team:ML)

elastic / elasticsearch