eclipse-hono / hono

Eclipse Hono™ Project
https://eclipse.dev/hono
Eclipse Public License 2.0
452 stars 137 forks source link

Error in DelegatedCommandSender while sending command to another adapter instance #1457

Closed DanielMaier-BSI closed 5 years ago

DanielMaier-BSI commented 5 years ago

We get the following error in DelegatedCommandSender very frequently in our automated system tests (quote from tracing)

{ "key": "error.object", "type": "string", "value": "org.eclipse.hono.client.ServerErrorException: peer did not settle message, failing delivery" }

according log statement:

peer did not settle message [message ID: DelegatedCommandSenderImpl-304, remote state: Rejected], failing delivery

Our test does the following:

This sequence runs multiple times for the same device and we have multiple MQTT adapter instances running, i.e it is very likely that device is connected to different adapters during tests.

If we add a 1 second sleep between receiving event that signals device is ready to receive commands and sending of the command we do not get this error.

What do you think, can this be related to some timing issue in hono, or is it more likely that this is related to our messaging infrastructure?

sophokles73 commented 5 years ago

@calohmn can you please take a look at this?

calohmn commented 5 years ago

Adding some further debug output has revealed that the Rejected disposition has the error "Deliveries cannot be sent to an unavailable address":

Rejected{error=Error{condition=amqp:not-found, description='Deliveries cannot be sent to an unavailable address', info=null}

(I've improved log output since to always reveal this, see https://github.com/eclipse/hono/commit/7aa91cd29f3d16d0f28033c3de3bade16e09b753.)

What seemingly happened here: The DelegatedCommandSenderImpl that forwards the received command on the device-specific command address uses a sender link on the anonymous relay (so that one link can be used for multiple tenants/devices). That means the sender doesn't get credits for the particular address to be used here and so it isn't assured that there actually is a receiver for the device-specific command address. The Qpid error message above indicates just that; no receiver on that address was available.

At first sight, this doesn't seem possible, because we always create a device-specific command receiver link first, before the application gets the indication that a device can receive a command. But here the fact comes into play that a network of multiple Qpid dispatch router instances is used in this case.

So, the steps were: The receiver link gets created in connection with Qpid instance A. Then the command message should get sent on the anonymous relay in connection with Qpid instance B. The information about the receiver link wasn't propagated in the router network to Qpid instance B yet, so an "unavailable address" is returned.

See also this discussion on the Qpid list on this.

calohmn commented 5 years ago

To fix this, I see these options:

  1. Insert a delay after creating the device-specific command consumer link and before informing the application about the fact that the device can receive commands. During this delay, the information about the attached receiver is supposed to be propagated in the router network, so that when sending the command message on the anonymous relay link, the message gets accepted (hopefully).
  2. Use a sender link with the device-specific command address (instead of using the anonymous relay). In case the received command has to be forwarded via the DelegatedCommandSenderImpl, the sender link is opened, the command message is sent and then the link is closed.
  3. Use the current solution (with anonymous relay link) and do retries if sending the command message failed with the above error.

With solution 1, the question is how to choose a delay that is small enough not to delay the whole command processing too much (could be noticable especially in the HTTP case) but large enough to ensure qpid has propagated the receiver info to all instances.

With solution 3, the question is which delay period and which number of retries to choose.

I would prefer solution 2 here. I would rather have the (I guess small) overhead of creating the link each time instead of having to deal with delays or retries.