Command Router: Errors/Delays when internal command topic got deleted

calohmn commented 3 years ago

After a protocol adapter pod has been stopped, the Command Router might still try to forward command messages on the internal command address (if the command consumers using that adapter were not properly unregistered).

This should not result in any problems - the assignment of any devices to the obsolete adapter instance id will be overwritten by devices connecting to different adapter instances over time anyway.

But a log at the logs of the Command Router reveals that the Kafka producer on the internal topic is trying to fetch metadata for the deleted topic for a long time:

09:11:29.622 [kafka-producer-network-thread | hono-command-router-producer-internal-cmd-sender-497db86c-e182-4f78-aa43-707f3bdb076b] WARN  o.apache.kafka.clients.NetworkClient - 
[Producer clientId=hono-command-router-producer-internal-cmd-sender-497db86c-e182-4f78-aa43-707f3bdb076b] 
Error while fetching metadata with correlation id 18 : {hono.command_internal.HonoMQTTAdapter-1b478ab7-db57-4a27-9ba5-5ab0ae00e16f=UNKNOWN_TOPIC_OR_PARTITION}

09:11:29.726 [kafka-producer-network-thread | hono-command-router-producer-internal-cmd-sender-497db86c-e182-4f78-aa43-707f3bdb076b] WARN  o.apache.kafka.clients.NetworkClient - 
[Producer clientId=hono-command-router-producer-internal-cmd-sender-497db86c-e182-4f78-aa43-707f3bdb076b] 
Error while fetching metadata with correlation id 19 : {hono.command_internal.HonoMQTTAdapter-1b478ab7-db57-4a27-9ba5-5ab0ae00e16f=UNKNOWN_TOPIC_OR_PARTITION}

09:11:29.830 [kafka-producer-network-thread | hono-command-router-producer-internal-cmd-sender-497db86c-e182-4f78-aa43-707f3bdb076b] WARN  o.apache.kafka.clients.NetworkClient - 
[Producer clientId=hono-command-router-producer-internal-cmd-sender-497db86c-e182-4f78-aa43-707f3bdb076b] 
Error while fetching metadata with correlation id 20 : {hono.command_internal.HonoMQTTAdapter-1b478ab7-db57-4a27-9ba5-5ab0ae00e16f=UNKNOWN_TOPIC_OR_PARTITION}

calohmn commented 3 years ago

The behaviour of the KafkaProducer when trying to publish on a non-existing topic (with topic auto-creation disabled in the server) is:

if the producer currently has no local metadata for the topic: block on send() for the max.block.ms period (default 1 minute), resulting in a org.apache.kafka.common.errors.TimeoutException: Topic [topic] not present in metadata after [max.block.ms value] ms. exception.
if the producer currently still has local metadata for the topic, i.e. if the topic got deleted less than metadata.max.age.ms (default 5 minutes) after the producer last got a metadata update for the topic: block on send() for the delivery.timeout.ms period (default 2 minutes), resulting in a org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for [topic]:[x >= delivery.timeout.ms] ms has passed since batch creation exception.

After that, there will be repeated further attempts (every 100ms it seems) to update the metadata for the topic on the kafka-producer-network-thread for a period of metadata.max.idle.ms (default 5 minutes).

calohmn commented 3 years ago

Workarounds/ways to prevent this:

don't delete the hono.command_internal.[adapterInstance] topic on adapter shutdown -> this would leave behind many unused topics over time, ~~where you can't see which is still being used~~ (EDIT: with #2804. the ID format has been changed to include the pod name)
reduce the above mentioned KafkaProducer timeouts -> this would still result in quite long response times for commands that have no target available

The proper way to solve this is to check the status of the corresponding adapter instance before publishing on the hono.command_internal.[adapterInstance] topic, using the AdapterInstancesLivenessService, as planned in #2028.

calohmn commented 3 years ago

An AdapterInstancesLivenessService has now been implemented and is being used before publishing on the internal command topic (#2028).

Edge cases where the liveness service hasn't yet noticed that the adapter is dead and commands still get forwarded to the adapter might still occur. But in any case this would be an exceptional scenario. Normally, on adapter shutdown, the command-to-adapterInstance mappings first get removed by the adapter (via unregisterCommandConsumer invocations, see also #2760 here) and only then the internal command topic gets deleted. To prevent such scenarios, probably caused by errors invoking unregisterCommandConsumer, I think it would make sense to first look at #2760.

eclipse-hono / hono

Command Router: Errors/Delays when internal command topic got deleted #2773