Autoreconnect not working after ping receive failure

moneyease commented 3 years ago

Release: 1.3.5 We are using AWS IoT Core as broker. During our test, our client receives a few IoT notifications and publishes some stats to dynomodb and then this library gives out the following messages. I was hoping to reconnect back to the broker but looks like c.workers.Wait() is blocked. Is there any known issue if `pingresp fails and autoreconnect?

our MsgHandler code looks like this, in particular, we see this issue when RestAPI response is large

func (c *Client) MessageHandler(client MQTT.Client, message MQTT.Message) {
    someChan <-message
} 

func() {
 for (
     <-someChan
     //call other RestAPI  (might take a few seconds)
      c.Publish //ack data receive, this where pingresp is failing
    )
}

DEBUG    17:41:37 [client]   enter Publish
DEBUG    17:41:37 [client]   sending publish message, topic: 1458884425/plugin/fw_ipsyncd_stats
DEBUG    17:41:37 [net]      obound msg to write 5
DEBUG    17:41:37 [net]      obound wrote msg, id: 5
DEBUG    17:41:37 [net]      outgoing waiting for an outbound message
DEBUG    17:41:37 [pinger]   ping check 0.379317321
DEBUG    17:41:42 [pinger]   ping check 5.379279927
DEBUG    17:41:47 [pinger]   ping check 10.379349293
DEBUG    17:41:52 [pinger]   ping check 15.379308358
DEBUG    17:41:52 [pinger]   keepalive sending ping
DEBUG    17:41:57 [pinger]   ping check 4.999789194
DEBUG    17:42:02 [pinger]   ping check 9.999907576
DEBUG    17:42:07 [pinger]   ping check 14.999851521
CRITICAL 17:42:07 [pinger]   pingresp not received, disconnecting
DEBUG    17:42:07 [client]   internalConnLost called
DEBUG    17:42:07 [client]   stopCommsWorkers called
DEBUG    17:42:07 [client]   internalConnLost waiting on workers
DEBUG    17:42:07 [client]   stopCommsWorkers waiting for workers

Config

    opts := MQTT.NewClientOptions()
    opts.AddBroker(c.BrokerEndpoint)
    opts.SetClientID(c.ClientID)
    opts.SetCleanSession(true)
    opts.SetTLSConfig(tlsConf)
    opts.SetOrderMatters(true)
    opts.SetResumeSubs(true)
    opts.SetDefaultPublishHandler(c.DefaultMessageHandler)
    opts.SetConnectionLostHandler(c.connLostHandler)

MattBrittan commented 3 years ago

Please see the readme for information needed when raising an issue. Providing a log by itself does not provide enough context to track down an issue (the package has a lot of options so without knowledge of your setup its difficult to comment based purely upon a log entries) especially given that the issue may be due to something that occurred prior to the start of the included log.

My initial guess (based on no evidence as it's not covered in the logs) is that you have a MessageHandler that is blocking (see common problems).

moneyease commented 3 years ago

thanks for your quick input, please review this and I believe we are not blocking MessageHandler. Just on side notes, we see some clients never received IoT notification causing them to go out of sync, do you have any experience with such a scenario?

mqttClient.Publish(topic, 1, false, payload)

MattBrittan commented 3 years ago

Sorry I'm going to need more than that; unless I can see the issue in the logs or have a minimal reproducible example is best its not really possible to help.

//call other RestAPI (might take a few seconds) From "common issues" - "If you wish to perform a long-running task, or publish a message, then please use a go routine (blocking in the handler is a common cause of unexpected pingresp not received, disconnecting errors)."

some clients never received IoT notification causing them to go out of sync There are all kinds of things that could cause this (subscribing at QOS 0, restarts with SetCleanSession(true), broker configuration etc) so this is not really something I can answer. I don't personally use AWS IoT Core but do have a range of fairly high volume/frequency applications that run for months (in some cases years) without loosing a message (there is always a possibility of bugs in the library but I'm not currently aware of anything that would result in message loss).

moneyease commented 3 years ago

Thanks Matt, my channel on the producer side was getting full (whereas the consumer working on calling RestAPI) and blocking MsgHandler and causing pings to fail. I don't need a fix now.

Although, if we can redial when the consumer yields would be a good way to recover otherwise client loses connectivity forever.

eclipse-paho / paho.mqtt.golang

Autoreconnect not working after ping receive failure #552