cryostatio / cryostat-legacy

OUTDATED - See the new repository below! -- Secure JDK Flight Recorder management for containerized JVMs
https://github.com/cryostatio/cryostat
Other
224 stars 31 forks source link

[Bug] Hang on opening JMX connection #1778

Open andrewazores opened 11 months ago

andrewazores commented 11 months ago

Current Behavior

Nov 17, 2023 2:17:07 PM org.slf4j.impl.JDK14LoggerAdapter fillCallerData
WARNING: Thread Thread[vert.x-worker-thread-12,5,main] has been blocked for 4491255 ms, time limit is 60000 ms
io.vertx.core.VertxException: Thread blocked
at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method)
at java.base@17.0.9/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
at java.base@17.0.9/java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1864)
at java.base@17.0.9/java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3465)
at java.base@17.0.9/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3436)
at java.base@17.0.9/java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1898)
at java.base@17.0.9/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2072)
at app//io.cryostat.net.TargetConnectionManager.executeConnectedTask(TargetConnectionManager.java:168)
at app//io.cryostat.net.web.http.api.v1.TargetRecordingsGetHandler.handleAuthenticated(TargetRecordingsGetHandler.java:124)
at app//io.cryostat.net.web.http.AbstractAuthenticatedRequestHandler.handle(AbstractAuthenticatedRequestHandler.java:102)
at app//io.cryostat.net.web.http.AbstractAuthenticatedRequestHandler.handle(AbstractAuthenticatedRequestHandler.java:72)
at app//io.vertx.ext.web.impl.BlockingHandlerDecorator.lambda$handle$0(BlockingHandlerDecorator.java:48)
at app//io.vertx.ext.web.impl.BlockingHandlerDecorator$$Lambda$1152/0x00007fec00a5fd40.handle(Unknown Source)
at app//io.vertx.core.impl.ContextBase.lambda$executeBlocking$0(ContextBase.java:137)
at app//io.vertx.core.impl.ContextBase$$Lambda$1084/0x00007fec009130b0.handle(Unknown Source)
at app//io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:264)
at app//io.vertx.core.impl.ContextBase.lambda$executeBlocking$1(ContextBase.java:135)
at app//io.vertx.core.impl.ContextBase$$Lambda$1077/0x00007fec00917210.run(Unknown Source)
at app//io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76)
at app//io.vertx.core.impl.TaskQueue$$Lambda$179/0x00007fec001c6780.run(Unknown Source)
at java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base@17.0.9/java.lang.Thread.run(Thread.java:840)

Expected Behavior

If the JMX connection cannot be opened within the defined timeout, the connection attempt should be aborted.

This stack trace indicates that not only is the connection not aborted, but that the connection is being done directly on a Vertx workerpool thread instead of an additional application-managed thread as it probably should be.

Steps To Reproduce

No response

Environment

Anything else?

Probably related:

andrewazores commented 11 months ago

ebaron I think the worst part of this is that it doesn't cause the liveness probe to fail. AppSRE doesn't seem to like us manually needing to restart pods. /health/liveness returns 204, but /health returns 504.

aazores that's a good observation - /health does a little more in that it also tries to check on the status of the datasource/dashboard/reports sidecars, and therefore it runs on a worker thread from the same pool that is blocked by that original bug. /health/liveness just returns immediately on the vertx event loop thread. so what this means is that the event loop thread is still alive and unblocked, but there are no available unblocked worker threads to dispatch to for more complex requests

andrewazores commented 11 months ago

@ebaron it might therefore be useful to force /health/liveness to delegate off to a worker thread, even if it isn't technically necessary, just so that it can also evaluate whether that pool is actually responsive - since most of the actual useful API calls have to go through that layer. That doesn't fix the problem but at least it allows container management systems to better detect this case and perform a container restart.

andrewazores commented 11 months ago

Mistakenly closed by the previous PR - that is only a mitigation, it helps detect this case and allow container management systems to restart Cryostat, but does not actually prevent this from happening.