Azure / azure-iot-arduino

Azure IoT library for the Arduino
Other
168 stars 95 forks source link

Gracefully reconnect after sas-url expires #60

Closed Resultfactory closed 6 years ago

Resultfactory commented 7 years ago

We noticed the SAS_TOKEN_DEFAULT_LIFETIME set to an hour. This causes the system to be down for 4 minutes every hour times SAS_REFRESH_MULTIPLIER 0.8.

-> 10:55:08 DISCONNECT  69129107                // is 19.2 hours = 24 * 0.8
-> 10:55:08 CONNECT 69129377
timed out waiting for CONNACK   69361077
-> 10:59:09 DISCONNECT  69370097
-> 10:59:10 CONNECT 69371437
<- 10:59:10 CONNACK 69371457                // 4 minutes later

DEFAULT_MQTT_KEEPALIVE is set to 4 60 SAS_TOKEN_DEFAULT_LIFETIME is set to 3600 24 SAS_REFRESH_MULTIPLIER 0.8

We changed the lifetime to 24 hours, but we would like to gracefully refresh the sas token and reconnect. Preferably without the 4 minutes downtime, caused by the DEFAULT_MQTT_KEEPALIVE? Any suggestions?

mamokarz commented 7 years ago

Please Roy, can you investigate.

bgshankars commented 7 years ago

Is there any way to override the default sas token default lifetime, what would I need to do, incase of reducing the lifetime to 30s.

Resultfactory commented 7 years ago

@bgshankars You can change the defines: SAS_TOKEN_DEFAULT_LIFETIME or DEFAULT_SAS_TOKEN_EXPIRY_TIME_SECS We are not sure which is used, probably the define in the transport you are using.

Resultfactory commented 7 years ago

We did some additional digging, and it seems the refreshing of the Sas-url is not the problem. When a disconnect is done after the token lifetime expires, the socket or tls does not function correctly. The connection seems valid but the tls handshake completes too quickly. A connect message is sent over the seemingly valid connection and generates a timeout after 4 minutes. The connect after the timed out connection does succeed.

Log:

D (29626) tlsio_mbedtls AzureIoTHubTask: Handshake completed in 1050, on_underlying_io_open_complete
D (29626) tlsio_mbedtls AzureIoTHubTask: tlsio_mbedtls_send
-> 11:28:51 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: XXX.azure-devices.net/XXX/api-version=2016-11-14&DeviceClientType=iothubclient%2f1.1.21%20(esp32) | PWD: XXXX | CLEAN: 0
D (29656) tlsio_mbedtls AzureIoTHubTask: leaving tlsio_mbedtls_open with 0
<- 11:28:51 CONNACK | SESSION_PRESENT: true | RETURN_CODE: 0x0

<- 11:32:49 PUBACK | PACKET_ID: 8
D (318596) tlsio_mbedtls AzureIoTHubTask: tlsio_mbedtls_send
D (318596) tlsio_mbedtls AzureIoTHubTask: tlsio_mbedtls_close
D (318596) socketio_lwip AzureIoTHubTask: Socket Closed in socketio_close
-> 11:33:40 DISCONNECT
D (318706) tlsio_mbedtls AzureIoTHubTask: tlsio_mbedtls_open
D (318766) socketio_lwip AzureIoTHubTask: Opened socketio_open:484
D (318776) tlsio_mbedtls AzureIoTHubTask: on_underlying_io_open_complete
D (318776) tlsio_mbedtls AzureIoTHubTask: on_underlying_io_open_complete
D (318776) tlsio_mbedtls AzureIoTHubTask: Handshake completed in 0, on_underlying_io_open_complete
D (318786) tlsio_mbedtls AzureIoTHubTask: tlsio_mbedtls_send
-> 11:33:40 CONNECT | VER: 4 | KEEPALIVE: 240 | FLAGS: 192 | USERNAME: XXX.azure-devices.net/XXX/api-version=2016-11-14&DeviceClientType=iothubclient%2f1.1.21%20(esp32) | PWD: XXXX | CLEAN: 0
D (318826) tlsio_mbedtls AzureIoTHubTask: leaving tlsio_mbedtls_open with 0
I (328256) Statistics: [05-10-17 11:33:49Z] Uptime: 000.00:05:00    IPv4: 10.0.2.252    Free Heap: 66084    Temperature: 67
I (388256) Statistics: [05-10-17 11:34:49Z] Uptime: 000.00:06:00    IPv4: 10.0.2.252    Free Heap: 65828    Temperature: 66
I (448256) Statistics: [05-10-17 11:35:49Z] Uptime: 000.00:07:00    IPv4: 10.0.2.252    Free Heap: 65772    Temperature: 66
I (508256) Statistics: [05-10-17 11:36:49Z] Uptime: 000.00:08:00    IPv4: 10.0.2.252    Free Heap: 65520    Temperature: 66
Error: Time:Thu Oct  5 11:37:41 2017 File:C:/IoT/BEASTv3/components/azure_iot/iothubtransport_mqtt_common.c Func:InitializeConnection Line:1583 mqtt_client timed out waiting for CONNACK
D (559636) tlsio_mbedtls AzureIoTHubTask: tlsio_mbedtls_send
D (559646) tlsio_mbedtls AzureIoTHubTask: tlsio_mbedtls_close
D (559646) socketio_lwip AzureIoTHubTask: Socket Closed in socketio_close
-> 11:37:41 DISCONNECT
tameraw commented 7 years ago

@JetstreamRoySprowl - Any update on this one?

ewertons commented 7 years ago

Hi @Resultfactory ,

the SAS token refresh logic design did not intend to cause any long down time. For MQTT, the disconnection and reconnection using a new SAS token is necessary due to the protocol constraints. The four-minute timeout you are seeing should not be related to the SAS token refresh logic. It does seem that the connection itself had some problem, and the reconnection packet timedout being sent, that is probably what caused the issue.

To assist better, could you:

  1. Elaborate more why you believe the SAS token logic might be influencing the down time (so we make sure we understand how you got to that conclusion, and to ensure we don't miss your point);
  2. Re-run your experiment and repro the issue, so we can check the service-side logs why the timeout happened (please send us the iothub fqdn and device ID as well)

Thanks, Azure IoT Team

Resultfactory commented 7 years ago

@ewertons in our last post we explain why we do not think the sas logic is at fault anymore. We confirmed that the sas-url is generated again at that time, but we still get the 4 minute downtime.

Our issue of not being able to reconnect only occurs every 19 hours for 4 minutes.

Our scenario:

  1. Startup. We set up the ssl connection, send a CONNECT message and receive a CONNACK.
  2. After ~19 hours we receive a DISCONNECT and the ssl and underlying socket are closed. A new ssl and socket are created. The ssl handshake completes in 0ms, a CONNECT message is sent.
  3. 4 minutes later the connection is timed out.
  4. a new ssl and socket are set up, handshake completes(in more than 0ms), a CONNECT message is sent and CONNACK received.

Is there a email address we can send the iothub fqdn and deviceid to?

Resultfactory commented 6 years ago

Note We don't have the 4 minute connection timeout anymore since the update. Our xio still does get destroyed and recreated, is this the expected behavior?

JetstreamRoySprowl commented 6 years ago

The SDK may be getting upset by this issue. I hope to have a fix for that problem in the next Arduino release.

JetstreamRoySprowl commented 6 years ago

The 1.0.44 Azure IoT libraries now have the fix for the retrieveoptions problem.