Investigate slower liveness check request processing in Hono 2.0

calohmn commented 2 years ago

Compared with previous Hono versions, HTTP requests on the /liveness endpoint of Hono components seem to take longer on Hono 2.0 components. This means the Kubernetes liveness probes run into a timeout more often than before a Hono 2.0 update.

This was observed with a liveness probe timeout set to 1s on device-registry-mongodb and (to a lesser extent) hono-service-auth components.

In Hono 2.0, liveness requests get handled via the Quarkus SmallRye Health component. This involves some more overhead compared to the plain vert.x HTTP server implementation before. The question is, in how far this is relevant concerning the observed request durations. Example stacktrace of the invocation of the SmallRye Health AsyncHealthCheck handler:

at org.eclipse.hono.service.SmallRyeHealthCheckServer$SmallRyeHealthAdapter.lambda$register$4(SmallRyeHealthCheckServer.java:167)
at io.smallrye.context.impl.wrappers.SlowContextualSupplier.get(SlowContextualSupplier.java:21)
at io.smallrye.mutiny.operators.uni.builders.UniCreateFromDeferredSupplier.subscribe(UniCreateFromDeferredSupplier.java:25)
at io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)
at io.smallrye.mutiny.operators.uni.UniOnFailureFlatMap.subscribe(UniOnFailureFlatMap.java:31)
at io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)
at io.smallrye.mutiny.operators.uni.UniOnItemTransform.subscribe(UniOnItemTransform.java:22)
at io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)
at io.smallrye.mutiny.operators.uni.UniAndCombination$UniHandler.subscribe(UniAndCombination.java:234)
at io.smallrye.mutiny.operators.uni.UniAndCombination$AndSupervisor.run(UniAndCombination.java:88)
at io.smallrye.mutiny.operators.uni.UniAndCombination$AndSupervisor.access$000(UniAndCombination.java:57)
at io.smallrye.mutiny.operators.uni.UniAndCombination.subscribe(UniAndCombination.java:54)
at io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)
at io.smallrye.mutiny.operators.uni.UniBlockingAwait.await(UniBlockingAwait.java:60)
at io.smallrye.mutiny.groups.UniAwait.atMost(UniAwait.java:65)
at io.smallrye.health.SmallRyeHealthReporter.getLiveness(SmallRyeHealthReporter.java:221)
at io.smallrye.health.SmallRyeHealthReporter_ClientProxy.getLiveness(Unknown Source)
at io.quarkus.smallrye.health.runtime.SmallRyeLivenessHandler.getHealth(SmallRyeLivenessHandler.java:11)
at io.quarkus.smallrye.health.runtime.SmallRyeHealthHandlerBase.doHandle(SmallRyeHealthHandlerBase.java:44)
at io.quarkus.smallrye.health.runtime.SmallRyeHealthHandlerBase.handle(SmallRyeHealthHandlerBase.java:31)
at io.quarkus.smallrye.health.runtime.SmallRyeLivenessHandler.handle(SmallRyeLivenessHandler.java:7)
at io.quarkus.smallrye.health.runtime.SmallRyeHealthHandlerBase.handle(SmallRyeHealthHandlerBase.java:19)
at io.vertx.ext.web.impl.BlockingHandlerDecorator.lambda$handle$0(BlockingHandlerDecorator.java:48)
at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:159)
at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100)
at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:157)
at io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:548)
at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2449)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1478)
at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)

Handling switches from a vert.x event-loop-thread to an executor thread here (via Context.executeBlocking).

The device-registry and hono-service-auth components don't even have specific liveness check actions. The protocol adapters and the Command Router use a "EventLoopBlockedCheck". I'm not sure whether that check is really needed.

calohmn commented 2 years ago

In Hono 1.x, using the Hono Helm chart, the liveness/readiness probes were done using HTTPS by default, whereas with Hono 2.0, there was a switch to plain HTTP.

sophokles73 commented 2 years ago

Shouldn't this make processing the requests faster rather than slower?

calohmn commented 2 years ago

Yes, it makes the observed change in behaviour all the more strange.

eclipse-hono / hono

Investigate slower liveness check request processing in Hono 2.0 #3316