aws-amplify / aws-sdk-ios

AWS SDK for iOS. For more information, see our web site:
https://aws-amplify.github.io/docs
Other
1.68k stars 885 forks source link

AWSIoT reconnecting issues #965

Closed mattczech closed 6 years ago

mattczech commented 6 years ago

To help us solve your problem better, please answer the following list of questions.

AWSIoT

2.6.21

Actual devices

No

No

CocoaPods

After reconnecting from an offline state I will occasionally see an error "WebSocket closed with code:1001 with reason:Stream end encountered" show up in the logs. When this happens, I receive no notification that the connection status has changed (AWSIoTMQTTStatus == .connected) and new messages that are sent to subscribed topics are not received by the app. I would expect that after this occurs a reconnect would be attempted to restore the connection.

2018-06-18 16:32:15:975 [13892:4724335] [SR] NSStreamEventHasBytesAvailable <__NSCFInputStream: 0x1d4300bd0>
2018-06-18 16:32:15:975 [13892:4724335] [SR] NSStreamEventEndEncountered <__NSCFInputStream: 0x1d4300bd0>
2018-06-18 16:32:15:975 [13892:4724335] [SR] NSStreamEventEndEncountered <__NSCFOutputStream: 0x1d411fe30>
2018-06-18 16:32:15:975 [13892:4724335] [SR] using _delegateDispatchQueue.
2018-06-18 16:32:15:976 [13892:4724338] WebSocket closed with code:1001 with reason:Stream end encountered
2018-06-18 16:32:15:976 [13892:4724840] -[MQTTDecoder stream:handleEvent:] [Line 55] EventCode:2, stream: <__NSCFInputStream: 0x1d011e720>, Thread: <NSThread: 0x1d0878040>{number = 36, name = (null)}
2018-06-18 16:32:15:976 [13892:4724840] -[MQTTDecoder stream:handleEvent:] [Line 55] EventCode:16, stream: <__NSCFInputStream: 0x1d011e720>, Thread: <NSThread: 0x1d0878040>{number = 36, name = (null)}
2018-06-18 16:32:15:976 [13892:4724840] -[MQTTSession decoder:handleEvent:] [Line 331] eventCode:1
2018-06-18 16:32:15:976 [13892:4724840] MQTT session error, code: 2
2018-06-18 16:32:15:976 [13892:4724840] closing encoder stream.
2018-06-18 16:32:15:976 [13892:4724840] closing decoder stream.
2018-06-18 16:32:16:059 [13892:4724674] ClockTick:24: nothing to republish
2018-06-18 16:32:16:981 [13892:4724840] MQTTSessionDelegate handleEvent: 2
2018-06-18 16:32:16:981 [13892:4724840] MQTT session closed.
2018-06-18 16:32:17:059 [13892:4724674] ClockTick:25: nothing to republish
2018-06-18 16:32:18:055 [13892:4724674] ClockTick:26: nothing to republish
2018-06-18 16:32:19:059 [13892:4724674] ClockTick:27: nothing to republish
2018-06-18 16:32:20:059 [13892:4724674] ClockTick:28: nothing to republish
2018-06-18 16:32:21:058 [13892:4724674] ClockTick:29: nothing to republish
2018-06-18 16:32:22:055 [13892:4724674] ClockTick:30: nothing to republish
2018-06-18 16:32:23:058 [13892:4724674] ClockTick:31: nothing to republish
2018-06-18 16:32:24:056 [13892:4724674] ClockTick:32: nothing to republish
2018-06-18 16:32:25:056 [13892:4724674] ClockTick:33: nothing to republish
2018-06-18 16:32:26:057 [13892:4724674] ClockTick:34: nothing to republish
2018-06-18 16:32:27:057 [13892:4724674] ClockTick:35: nothing to republish
2018-06-18 16:32:28:058 [13892:4724674] ClockTick:36: nothing to republish
2018-06-18 16:32:29:058 [13892:4724674] ClockTick:37: nothing to republish
2018-06-18 16:32:30:059 [13892:4724674] ClockTick:38: nothing to republish
2018-06-18 16:32:31:059 [13892:4724674] ClockTick:39: nothing to republish
2018-06-18 16:32:32:058 [13892:4724674] ClockTick:40: nothing to republish
2018-06-18 16:32:33:059 [13892:4724674] ClockTick:41: nothing to republish

Looking further in the logs, it seems like this might be caused by attempting to reconnect after the connection has already been established. Here you can see that a message was just received and then it immediately attempts to reconnect, even though there is no error indicating that there was anything wrong with the connection.

2018-06-18 16:32:14:220 [13892:4724335] Websocket didReceiveMessage: Received 129 bytes
2018-06-18 16:32:14:220 [13892:4724674] -[MQTTDecoder stream:handleEvent:] [Line 55] EventCode:2, stream: <__NSCFInputStream: 0x1d411f920>, Thread: <NSThread: 0x1d4863240>{number = 32, name = (null)}
2018-06-18 16:32:14:220 [13892:4724674] -[MQTTSession decoder:newMessage:] [Line 351] messageType=3, status=2
2018-06-18 16:32:14:220 [13892:4724674] MQTTSession- newMessage msg type is 3
2018-06-18 16:32:14:220 [13892:4724674] -[MQTTSession handlePublish:] [Line 462] 
2018-06-18 16:32:14:220 [13892:4724674] <<<NSThread: 0x1d4863240>{number = 32, name = (null)}>>Topic: f988a889-a5a9-4736-97df-4d48a029ea9f is matched.
2018-06-18 16:32:14:220 [13892:4724674] <<<NSThread: 0x1d4863240>{number = 32, name = (null)}>>topicModel.callback.
2018-06-18 16:32:14:535 [13892:4724446] Attempting to reconnect.
2018-06-18 16:32:14:535 [13892:4724446] AWSIoTMQTTClient: connecting via websocket. 
2018-06-18 16:32:14:556 [13892:4724446] username is : ?SDK=iOS&Version=2.6.19

2018-06-18 16:32:14:556 [13892:4724446] Metrics collection is: Enabled
2018-06-18 16:32:14:557 [13892:4724446] Websocket is created and opened.

I'm able to regularly reproduce this issue by putting my device in airplane mode, turning off wifi, waiting a few seconds or minutes and then turning wifi back on.

scb01 commented 6 years ago

@mattczech Thank you for reporting this issue. I will look into this and get back to you.

aat2703 commented 6 years ago

Any news regarding this issue, seems like there still is issues with reconnecting logic?

scb01 commented 6 years ago

@mattczech, @aat2703

Just a quick update. I have been able to reproduce the issue on my side and am continuing to work on it. Will post an update once I finish up my analysis.

aat2703 commented 6 years ago

Cool @cbommas

Could this issue have any relation to https://github.com/aws/aws-sdk-ios/issues/972 ?

scb01 commented 6 years ago

@aat2703

Yep, these are related. I have a fix that I am testing on my side and am hoping to include in the next rev of the SDK.

aat2703 commented 6 years ago

@cbommas thats what i thought hehe.. Will you loop back when you have an scheduled release date with this bug fix? 👍

scb01 commented 6 years ago

Absolutely. I will circle back on this thread once the fix is out.

aat2703 commented 6 years ago

@cbommas do you have an eta for the next rev of the SDK?

scb01 commented 6 years ago

@aat2703 @mattczech

Version 2.6.23 was released yesterday ( july 5) and contains the fixes for this issue. Please upgrade and let me know if you are still running into issues.

mattczech commented 6 years ago

@cbommas Thanks for looking into this and for the quick turn-around! v2.6.23 seems much more stable when disconnecting/reconnecting.

I'm still seeing an issue where the AWSIoTMQTTClient will attempt to reconnect when the connection status is connected.

Steps to reproduce:

I added a debug statement to [AWSIoTMQTTSession reconnectToSession] that prints out the current AWSIoTMQTTStatus. You can see in the log below Attempting to reconnect. status=2, which corresponds to AWSIoTMQTTStatusConnected. Why would it reconnect if the session is already connected?

2018-07-12 12:25:41:084 [19005:4156973] Websocket didReceiveMessage: Received 5 bytes
2018-07-12 12:25:41:084 [19005:4157015] -[MQTTDecoder stream:handleEvent:] [Line 55] EventCode:2, stream: <__NSCFInputStream: 0x1cc1033c0>, Thread: <NSThread: 0x1cc07b380>{number = 19, name = (null)}
2018-07-12 12:25:41:084 [19005:4157015] -[MQTTSession decoder:newMessage:] [Line 361] messageType=9, status=2
2018-07-12 12:25:41:084 [19005:4157015] MQTTSession- newMessage msg type is 9
2018-07-12 12:25:41:084 [19005:4157015] -[MQTTSession handleSuback:] [Line 444] 
2018-07-12 12:25:41:084 [19005:4157015] Sub Ack messageId 12
2018-07-12 12:25:41:084 [19005:4157015] MQTTSessionDelegate new ack for msgId: 12
2018-07-12 12:25:41:939 [19005:4157015] ClockTick:21: nothing to republish
2018-07-12 12:25:42:487 [19005:4157013] Attempting to reconnect. status=2
2018-07-12 12:25:42:487 [19005:4157013] AWSIoTMQTTClient: connecting via websocket. 
2018-07-12 12:25:42:487 [19005:4156973] [SR] Closing with code 1000 reason (null)
2018-07-12 12:25:42:488 [19005:4156967] [SR] NSStreamEventHasSpaceAvailable <__NSCFOutputStream: 0x1c01045c0>
2018-07-12 12:25:42:536 [19005:4157013] username is : ?SDK=iOS&Version=2.6.19
2018-07-12 12:25:42.536720-0500 [19005:4156973] [MQTT] MQTT connection status changed Connecting

This seems to be caused by a race condition of detecting multiple error conditions around the same time that each initiating a reconnect:

It seems like this could be resolved by adding a check on mqttStatus in [AWSIoTMQTTClient reconnectToSession] but I'm open to any insight you have.

aat2703 commented 6 years ago

Our app is currently on its way to the App Store. I'll request feedback from our customers once it has made its way out, and will circle back on this thread 👍

scb01 commented 6 years ago

@mattczech Thank you for your analysis on why the extra connection attempt is happening. I agree with your hypothesis and applied the fix you suggested. It has made it more stable and so far in my testing I haven't seen the extra connect. I will continue to test - Also, I will get this included in the next rev of the SDK. Thank you for your help in making this library better! 👍

scb01 commented 6 years ago

@mattczech, @aat2703

Rev 2.6.25 released on 7/18 has the fixes for the connectionAge timer and the fix for the extraneous connect. Can you upgrade to the latest and let us know how it goes?

aat2703 commented 6 years ago

What implications did the "connectionAge timer" bug have?

scb01 commented 6 years ago

The IoT client implements an exponential backoff strategy for connection retries. The initial value of the reconnectTime is set to baseReconnectTimeInterval set in the MQTTConfiguration object.

When the IoT client detects that it has lost the connection to the server, it will wait until the reconnectTime to try to reconnect. If the reconnect is unsuccessful, it will double the reconnectTime and wait before reconnecting. The doubling will happen until the maximumReconnectTimeInterval is reached.

From that point on the client will keep trying to reconnect once every reconnectTime seconds ( the reconnectTime will be pinned tomaximumReconnectTimeInterval seconds)

When the IoT client is able to connect successfully, it will wait for minimumConnectionTimeInterval before it resets the reconnectTime back to the baseReconnectTimeInterval.

The bug essentially was that the reconnectTime wasn't being reset. The implication is that if a client gets disconnected, the reconnectTime will max out and never be reset; which will mean that the client will only retry at the maximumReconnectTimeInterval. To the end user, it can appear that the app is taking a "long" time to reconnect ("long" is based on what the max is set to).

karasahinemre commented 6 years ago

In v2.6.25 we are seeing this log messages too.

ClockTick:64: nothing to republish

scb01 commented 6 years ago

@karasahinemre

Looks like you've set the logs at the debug level. You can eliminate these log messages by setting the log level of info or higher. I will set these messages to verbose in a future release.

scb01 commented 6 years ago

@karasahinemre

The change to move the messages to verbose will be included in the next rev of the SDK.

scb01 commented 6 years ago

Closing this issue as we are tracking updates on #1002