eclipse-hono / hono

Eclipse Hono™ Project
https://eclipse.dev/hono
Eclipse Public License 2.0
452 stars 137 forks source link

Command Router: Errors/Delays when internal command topic got deleted #2773

Closed calohmn closed 3 years ago

calohmn commented 3 years ago

After a protocol adapter pod has been stopped, the Command Router might still try to forward command messages on the internal command address (if the command consumers using that adapter were not properly unregistered).

This should not result in any problems - the assignment of any devices to the obsolete adapter instance id will be overwritten by devices connecting to different adapter instances over time anyway.

But a log at the logs of the Command Router reveals that the Kafka producer on the internal topic is trying to fetch metadata for the deleted topic for a long time:

09:11:29.622 [kafka-producer-network-thread | hono-command-router-producer-internal-cmd-sender-497db86c-e182-4f78-aa43-707f3bdb076b] WARN  o.apache.kafka.clients.NetworkClient - 
[Producer clientId=hono-command-router-producer-internal-cmd-sender-497db86c-e182-4f78-aa43-707f3bdb076b] 
Error while fetching metadata with correlation id 18 : {hono.command_internal.HonoMQTTAdapter-1b478ab7-db57-4a27-9ba5-5ab0ae00e16f=UNKNOWN_TOPIC_OR_PARTITION}

09:11:29.726 [kafka-producer-network-thread | hono-command-router-producer-internal-cmd-sender-497db86c-e182-4f78-aa43-707f3bdb076b] WARN  o.apache.kafka.clients.NetworkClient - 
[Producer clientId=hono-command-router-producer-internal-cmd-sender-497db86c-e182-4f78-aa43-707f3bdb076b] 
Error while fetching metadata with correlation id 19 : {hono.command_internal.HonoMQTTAdapter-1b478ab7-db57-4a27-9ba5-5ab0ae00e16f=UNKNOWN_TOPIC_OR_PARTITION}

09:11:29.830 [kafka-producer-network-thread | hono-command-router-producer-internal-cmd-sender-497db86c-e182-4f78-aa43-707f3bdb076b] WARN  o.apache.kafka.clients.NetworkClient - 
[Producer clientId=hono-command-router-producer-internal-cmd-sender-497db86c-e182-4f78-aa43-707f3bdb076b] 
Error while fetching metadata with correlation id 20 : {hono.command_internal.HonoMQTTAdapter-1b478ab7-db57-4a27-9ba5-5ab0ae00e16f=UNKNOWN_TOPIC_OR_PARTITION}
calohmn commented 3 years ago

Related: KAFKA-3450: Producer blocks on send to topic that doesn't exist if auto create is disabled

calohmn commented 3 years ago

The behaviour of the KafkaProducer when trying to publish on a non-existing topic (with topic auto-creation disabled in the server) is:

After that, there will be repeated further attempts (every 100ms it seems) to update the metadata for the topic on the kafka-producer-network-thread for a period of metadata.max.idle.ms (default 5 minutes).

calohmn commented 3 years ago

Workarounds/ways to prevent this:

The proper way to solve this is to check the status of the corresponding adapter instance before publishing on the hono.command_internal.[adapterInstance] topic, using the AdapterInstancesLivenessService, as planned in #2028.

calohmn commented 3 years ago

An AdapterInstancesLivenessService has now been implemented and is being used before publishing on the internal command topic (#2028).

Edge cases where the liveness service hasn't yet noticed that the adapter is dead and commands still get forwarded to the adapter might still occur. But in any case this would be an exceptional scenario. Normally, on adapter shutdown, the command-to-adapterInstance mappings first get removed by the adapter (via unregisterCommandConsumer invocations, see also #2760 here) and only then the internal command topic gets deleted. To prevent such scenarios, probably caused by errors invoking unregisterCommandConsumer, I think it would make sense to first look at #2760.