Open andrewazores opened 11 months ago
ebaron I think the worst part of this is that it doesn't cause the liveness probe to fail. AppSRE doesn't seem to like us manually needing to restart pods. /health/liveness returns 204, but /health returns 504.
aazores that's a good observation - /health does a little more in that it also tries to check on the status of the datasource/dashboard/reports sidecars, and therefore it runs on a worker thread from the same pool that is blocked by that original bug. /health/liveness just returns immediately on the vertx event loop thread. so what this means is that the event loop thread is still alive and unblocked, but there are no available unblocked worker threads to dispatch to for more complex requests
@ebaron it might therefore be useful to force /health/liveness
to delegate off to a worker thread, even if it isn't technically necessary, just so that it can also evaluate whether that pool is actually responsive - since most of the actual useful API calls have to go through that layer. That doesn't fix the problem but at least it allows container management systems to better detect this case and perform a container restart.
Mistakenly closed by the previous PR - that is only a mitigation, it helps detect this case and allow container management systems to restart Cryostat, but does not actually prevent this from happening.
Current Behavior
Expected Behavior
If the JMX connection cannot be opened within the defined timeout, the connection attempt should be aborted.
This stack trace indicates that not only is the connection not aborted, but that the connection is being done directly on a Vertx workerpool thread instead of an additional application-managed thread as it probably should be.
Steps To Reproduce
No response
Environment
Anything else?
Probably related:
1661
1498
1448
1106
929
312