amenzhinsky / iothub

Azure IoT Hub SDK for Golang
MIT License
51 stars 57 forks source link

AMQP Link Detach #76

Open alexg-axis opened 1 year ago

alexg-axis commented 1 year ago

I have an issue where I'm unable to publish events. Unfortunately I can't identify any more related circumstances than that. It has occurred some times, but in most cases it works as expected.

In essence the code works as follows:

ctx := context.Background()
message := []byte("Hello, World!")
expiry := 10 *60 * time.Second
deviceId := "some-device"

if err := client.SendEvent(
  ctx,
  deviceId,
  message,
  iotservice.WithSendAck((iotservice.AckType)("full")), 
  iotservice.WithSendExpiryTime(time.Now().Add(expiry)),
  ); err != nil {
  return err
}

The error is the following:

link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: Server Busy. Please retry operation, Info: map[]}

The Java SDK seems to have this comment regarding the error:

  /**
     * An operator intervened to detach for some reason.
     */
    LINK_DETACH_FORCED("amqp:link:detach-forced"),

Same with the JS one: https://github.com/Azure/amqp-common-js/blob/master/lib/errors.ts#L171.

So to me it seems as if this error may occur from time to time. For me, it has always been solved with a restart, so I assume one way to handle it is to simply reconnect the client.

alexg-axis commented 1 year ago

It seems to happen on a weekly basis. It could mean that Azure has some sort of timeout for 7 days and that we should gracefully reconnect when it occurs.

alexg-axis commented 1 year ago

Some information from the Python library.

https://github.com/Azure/azure-sdk-for-python/blob/a7ec3bca94251b6a73de347112d4a77e6e615ccc/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md?plain=1#L32

All Event Hubs exceptions are wrapped in an [EventHubError][EventHubError]. They often have an underlying AMQP error code which specifies whether an error should be retried. For retryable errors (ie. amqp:connection:forced or amqp:link:detach-forced), the client libraries will attempt to recover from these errors based on the [retry options][AmqpRetryOptions] specified when instantiating the client. To configure retry options, follow the sample [Client Creation][ClientCreation]. If the error is non-retryable, there is some configuration issue that needs to be resolved.

alexg-axis commented 1 year ago

We believe the following code is the cause - once a link is detached, there's no retry to get a session and link going again.

https://github.com/amenzhinsky/iothub/blob/master/iotservice/client.go#L171-L189

Note how, upon an error when putting a token, we just return and won't try any more. Likely, we become unauthorized and kicked from the server and the link becomes detached.