Closed DanielMaier-BSI closed 5 years ago
@calohmn can you please take a look at this?
Adding some further debug output has revealed that the Rejected
disposition has the error "Deliveries cannot be sent to an unavailable address":
Rejected{error=Error{condition=amqp:not-found, description='Deliveries cannot be sent to an unavailable address', info=null}
(I've improved log output since to always reveal this, see https://github.com/eclipse/hono/commit/7aa91cd29f3d16d0f28033c3de3bade16e09b753.)
What seemingly happened here: The DelegatedCommandSenderImpl that forwards the received command on the device-specific command address uses a sender link on the anonymous relay (so that one link can be used for multiple tenants/devices). That means the sender doesn't get credits for the particular address to be used here and so it isn't assured that there actually is a receiver for the device-specific command address. The Qpid error message above indicates just that; no receiver on that address was available.
At first sight, this doesn't seem possible, because we always create a device-specific command receiver link first, before the application gets the indication that a device can receive a command. But here the fact comes into play that a network of multiple Qpid dispatch router instances is used in this case.
So, the steps were: The receiver link gets created in connection with Qpid instance A. Then the command message should get sent on the anonymous relay in connection with Qpid instance B. The information about the receiver link wasn't propagated in the router network to Qpid instance B yet, so an "unavailable address" is returned.
See also this discussion on the Qpid list on this.
To fix this, I see these options:
With solution 1, the question is how to choose a delay that is small enough not to delay the whole command processing too much (could be noticable especially in the HTTP case) but large enough to ensure qpid has propagated the receiver info to all instances.
With solution 3, the question is which delay period and which number of retries to choose.
I would prefer solution 2 here. I would rather have the (I guess small) overhead of creating the link each time instead of having to deal with delays or retries.
We get the following error in DelegatedCommandSender very frequently in our automated system tests (quote from tracing)
{ "key": "error.object", "type": "string", "value": "org.eclipse.hono.client.ServerErrorException: peer did not settle message, failing delivery" }
according log statement:
peer did not settle message [message ID: DelegatedCommandSenderImpl-304, remote state: Rejected], failing delivery
Our test does the following:
This sequence runs multiple times for the same device and we have multiple MQTT adapter instances running, i.e it is very likely that device is connected to different adapters during tests.
If we add a 1 second sleep between receiving event that signals device is ready to receive commands and sending of the command we do not get this error.
What do you think, can this be related to some timing issue in hono, or is it more likely that this is related to our messaging infrastructure?