adafruit / Adafruit_IO_Arduino

Arduino library to access Adafruit IO from WiFi, cellular, and ethernet modules.
Other
207 stars 108 forks source link

io.run() hangs if WiFi disconnects #99

Closed sellensr closed 4 years ago

sellensr commented 4 years ago

My sketch has been saving data to Adafruit IO without problems for hours, then stops for no apparent reason. Restart and runs fine again. Hypothesis: WiFi drops connection for some reason. I reproduced that failure mode by explicitly disconnecting the WiFi after the sketch runs for 30 seconds with this code at the top of my loop():

#define P Serial.print

void loop() {
  if(millis()%60000 > 30000) WiFi.disconnect();
  delay(3000);
  P("\nTop of Loop!\nio.status: "); P(io.status()); P(": "); P(io.statusText()); P("\n"); 
  P("io.networkStatus: "); P(io.networkStatus()); P("\n"); 
  P("io.mqttStatus: "); P(io.mqttStatus()); P("\n"); 
  P("WiFi.status: "); P(WiFi.status()); P("\n"); 
  P("Calling io.run()...");
  // process messages and keep connection alive
  io.run();     
  P("Done\n");
  ..........

Resulting output shows io.status(), etc. detect the problem, but io.run() just goes away and never comes back:

08:45:55.738 -> 
08:45:55.738 -> Top of Loop!
08:45:55.738 -> io.status: 21: Adafruit IO connected.
08:45:55.738 -> io.networkStatus: 20
08:45:55.738 -> io.mqttStatus: 21
08:45:55.738 -> WiFi.status: 3
08:45:55.738 -> Calling io.run()...Done
08:45:59.085 -> 
08:45:59.085 -> Top of Loop!
08:45:59.085 -> io.status: 1: Network disconnected.
08:45:59.085 -> io.networkStatus: 1
08:45:59.085 -> io.mqttStatus: 2
08:46:19.077 -> WiFi.status: 6
08:46:19.077 -> Calling io.run()...

Following the library code down into Adafruit_MQTT finds some code to reconnect the mqtt, but I don't see anything to repair or respond to a bad WiFi connection, so I think it just hangs on line 171 in AdafruitIO.cpp: while(mqttStatus() != AIO_CONNECTED){} The obvious immediate fix is to test for a connection before calling io.run(), or fix the library to put a timeout in io.run() that will allow it to return instead of hanging, which still leaves the problem of how to recover from the WiFi drop. Calling io.connect() to reinitialize seems to rebuild the WiFi connection and allow the process to resume OK, but I don't know if that will generate memory leaks and lead to trouble if the sketch runs for months. So, before proposing changes I have questions for somebody with a better in-depth knowledge of the whole library:

  1. Should io.run() have a timeout and leave it to the calling sketch to monitor status?
  2. Should io.run() detect the network failure and rerun io.connect() automatically in the same way the mqtt library repairs that connection?
  3. Should there be an io.disconnect() (analogous to WiFi.disconnect) to explicitly shut down the connection and release any resources that have been acquired?

Thoughts?

Rick

sellensr commented 4 years ago

So far guarding the io.run() with if(io.status() >= AIO_NET_CONNECTED) and testing for connection has succeeded through more than 5200 forced post / disconnect / io.connect events without failing, so if there's a memory leak, it's not enormous. Checking the distance between the heap and the stack during execution backs this up. mqttStatus() takes about 20 seconds to timeout when the network has dropped, so a similar timeout duration on the io.run() call shouldn't break anybody's expectations, although failing quickly would be better.

sellensr commented 4 years ago

Allowing io.run() to return, even when throttled, before there is an MQTT connection may cause some code to fail if it doesn't check status during the loop. The likely consequence is making multiple attempts to save data while the MQTT connection is down, while not being aware that it's down. The code would continue to loop, dealing with other business, and then resume saving data once the MQTT connection came back up. It's worth noting that the save() method returns true for successfully sending a packet, which is not necessarily a successful save.

brentru commented 4 years ago

@sellensr Thanks for taking this issue on, I have a few comments before we move to the PR. Lots of devices run AIOA and I'd like to make sure this change is 100% backwards-compatible with pre-existing code and examples before we merge/release.

The likely consequence is making multiple attempts to save data while the MQTT connection is down, while not being aware that it's down.... It's worth noting that the save() method returns true for successfully sending a packet, which is not necessarily a successful save.

Ideally, io.run should handle the network re-connection, it would not be handled from user-code. The save method might need to be modified to indicate if a save was successful or not (this would return in user code)

Should io.run() detect the network failure and rerun io.connect() automatically in the same way the mqtt library repairs that connection?

Yes, especially because io.run() executes at the top of the loop.

Should there be an io.disconnect() (analogous to WiFi.disconnect) to explicitly shut down the connection and release any resources that have been acquired?

Yes, that'd be great as a separate PR. We do something similar in Adafruit IO CircuitPython, which also manages the WiFi connection.

sellensr commented 4 years ago

Thanks for the feedback! If the target is 100% back compatibility, then we would have to leave mqtt timing out if a network connection has failed, but we could detect the network failure in run() and repair it before calling on mqtt. That would make the timeout issue less likely.

I'm less convinced now about io.disconnect(), as WiFi.disconnect() would appear to be all it would need to do.

I'm not sure save needs a confirmation of successful receipt, nor that it would be easy to implement. Applications that really want to verify receipt can subscribe to complete the loop.

I'll try a version with io.run doing a network recovery and resubmit the pull request.

brentru commented 4 years ago

, but we could detect the network failure in run() and repair it before calling on mqtt. That would make the timeout issue less likely.

This is exactly how I'd like it to work if the network connection is dropped.

Keep in mind, this library works with WiFi/Ethernet/FONA hardware too, not just WiFi

sellensr commented 4 years ago

All the example code starts out with io.connect() then explicitly watching io.status() until the connection succeeds, which could take quite a while. It prints dots to let you know it is running. Then it goes on to io.run() in every loop() and just hangs if there is ever a network disconnect. Just as io.connect() returns before the connection succeeds, io.run() should probably return to allow the calling program to take appropriate action, which could vary with the application, and might include e.g. warnings to the user that they need to check the network hardware.

Letting io.run() return on a network failure won't break any existing code because that code would have just hung on a network failure.

Changing io.run() to return a status code instead of void won't break any existing code, and may encourage people to check that status code and do something about it.

I'm still concerned about the potential hang in io.run() if mqttStatus fails to reconnect. There should be a timeout on that while loop.

sellensr commented 4 years ago

The PR now has what I think is a minimum change to prevent the hanging behaviour. If that's accepted, then we can think about how best to have it repair the connection. I'm wondering about a failFast optional argument set to false by default. It would make things return faster and mean the caller would have to take responsibility for status checking and remediation. It could be passed on to mqttStatus() as well.

brentru commented 4 years ago

. If that's accepted, then we can think about how best to have it repair the connection. I'm wondering about a failFast optional argument set to false by default. It would make things return faster and mean the caller would have to take responsibility for status checking and remediation.

Good idea - lets work on the current PR first and then move on to adding a disconnect function next.

brentru commented 4 years ago

Patched in master and released in v3.3.0. @sellensr Thank you for the issue and PR!

jazzlw commented 4 years ago

I've been having a problem like this on my build but I'm on AIOA 3.4.0, so it seems like this problem should be fixed. Otherwise it's very similar though. Runs for hours before suddenly dropping off and not coming back. With reset it reconnects and continues happily.

Any Ideas on what might be still causing it or what to try fixing?

Also, I tried the phone hotspot disconnect and reconnect test mentioned above and it reconnected to that just fine...

Hardware is Metro M4 with airlift onboard.

Thanks!

sellensr commented 4 years ago

Your sketch may still need to detect the failure return code and make the reconnection.

jazzlw commented 4 years ago

Thanks for the tip @sellensr, could you be more specific though? I was under the impression from reading these that it should automatically reconnect if it's not connected during the io.run() call?

sellensr commented 4 years ago
//  if(io.status() >= AIO_NET_CONNECTED){            // don't fall into the abyss!!
    io.run();                                 // process messages and keep connection alive
//  } else {
//    P("Not Net Connected to IO!!! Restart!!!\n\n");
//    ioRestarts++;
//    setupIO();        //Start IO all over again    
//  }

This is code I was using in my loop() to detect a failure and go back to re-run io.connect. You could try something similar to locate what is causing your problem and find out if io.run() is returning. If io.run() is returning, check the value of the status code it is returning. It will try to reconnect, but may timeout on the attempt around line 300 in AdafruitIO.cpp if it is not immediately successful. If your wifi drops out for a little while you may be out of luck if you don't handle further retries.