MQTTYield with low runtime timesout on slow connections

mmarciniewski commented 6 years ago

I am experiencing problems with the MQTTYield function and the keepalive in the Embedded-C client when running it on a slow GPRS connection. The MQTTYield runtime must be long, (resulting in lower responsiveness of the system if not running in a dedicated task), else if the PING_RESP package is delayed more than one yield period, the connection is considered broken. A different approach, which solves the problem is to add a timer for the PING_RESP delay instead of/complimentary to the binary ping_outstanding flag currently in place.

theOzzieRat commented 6 years ago

I had a similar problem (but because I am using a fast cpu) where the pingresp doesn't get returned before the processor has gotten back to the keepalive code. I think the problem is that the last_received timer expires at the keepalive period (without any grace period). If you add a grace period to this timer everything appears to work correctly. My fix was to change the 2 places where the timer is set to add an extra 50% to the timer. So instead of the lines reading as: TimerCountdown(&c->last_received, c->keepAliveInterval); I changed them to TimerCountdown(&c->last_received, c->keepAliveInterval + (c->keepAliveInterval >> 1));

mmarciniewski commented 6 years ago

I also made a fix by adding a timer in the client structure for the keep-alive (called ping_resp),

typedef struct MQTTClient
{
    ...
    Timer last_sent, last_received, ping_resp;
    ...
} MQTTClient;

initializing it in the init:

void MQTTClientInit(MQTTClient* c, Network* network, unsigned int command_timeout_ms,
        unsigned char* sendbuf, size_t sendbuf_size, unsigned char* readbuf, size_t readbuf_size)
{
    ...
    TimerInit(&c->ping_resp);
    ...
}

Changing the timer in the sendPacket and readPacket function to be sent 4 seconds earlier:

static int sendPacket(MQTTClient* c, int length, Timer* timer)
{
    ...
    if (sent == length)
    {
        TimerCountdown(&c->last_sent, c->keepAliveInterval - 4); // record the fact that we have successfully sent the packet
        rc = SUCCESS;
    }
    ...
}

static int readPacket(MQTTClient* c, Timer* timer)
{
    ...
    if (c->keepAliveInterval > 0)
        TimerCountdown(&c->last_received, c->keepAliveInterval - 4); // record the fact that we have successfully received a packet
    ...
}

Then changing the keepalive function to:

int keepalive(MQTTClient* c)
{
    int rc = SUCCESS;

    if (c->keepAliveInterval == 0)
        goto exit;

    if (TimerIsExpired(&c->last_sent) || TimerIsExpired(&c->last_received))
    {
        if (c->ping_outstanding && TimerIsExpired(&c->ping_resp))
            rc = FAILURE; /* PINGRESP not received in keepalive interval */
        else if (!c->ping_outstanding)
        {
            Timer timer;
            TimerInit(&timer);
            TimerCountdownMS(&timer, 1000);
            int len = MQTTSerialize_pingreq(c->buf, c->buf_size);
            if (len > 0 && (rc = sendPacket(c, len, &timer)) == SUCCESS) { // send the ping packet
                c->ping_outstanding = 1;
        TimerCountdown(&c->ping_resp, 4);
            }
        }
    }

exit:
    return rc;
}

This solution does not violate the keep-alive interval although the previous solution will also work assuming the broker has a grace period (50% for Mosquitto).

Perhaps there should be a default set for the client (keeping the init API intact) and an option to change the expected RTT directly in the client object?

vonnieda commented 6 years ago

I've created a pretty simple fix for this issue in https://github.com/fireboardlabs/paho.mqtt.embedded-c/commit/b50fbf70932ff899ee85aa8b9d86389b5c0529bf. If anyone would like to give it a try, or offer feedback, before I send a PR that'd be helpful!

theOzzieRat commented 6 years ago

I am not sure that adding extra configuration parameters or timers for this is at all necessary. Having a look around the net I can't seem to find anythings that actually lets you adjust this. The only reference I found before when researching this and I can't find the link now, said that common practice was to simply add 10%-50% of the keepalive period as a grace period for packets being responded to. Hence my simplistic add 50%. This is probably overkill and could be reduced.

The other thing the keepalive code does is drop the connection on the first failed probe when it could have just been a dropped packet somewhere. Most keepalive setups seem to allow you to configure the number of failed probes before the connection is dropped. Many also allow you to set an interval between the failed probes and subsequent sending of the next probe (you don't want to wait the full keepalive period again). Many just hardcode this to a second or so. Using an extra timer to do this and counting failed probes seems like a better use for another timer.

There isn't a standard way of implementing keepalives except for a horribly outdated mention in RFC1122 which says the keepalive period should not be less than 2 hours.

Anyway, those are just some thoughts to prompt a little more thinking about what we are try to do here.

Cheers Simon

vonnieda commented 6 years ago

Hi @theOzzieRat, I've worked through the logic of your fix and I can't find any problems with it. Certainly cleaner and shorter than mine! :) I'll give it a shot and see how it performs.

vonnieda commented 6 years ago

@theOzzieRat I've tested your fix and it works as well (or better) then mine, with much less code, so I'm using it now. Thanks! :)

heezes commented 6 years ago

@theOzzieRat @vonnieda adding grace period to last_received is only going to increase the keepalive period on client end i dont see how it is a fix to the problem. Correct me if i'm wrong :)

theOzzieRat commented 6 years ago

@heezes The problem is that the client will send a keepalive, but not give any time for a response to be received. Adding a grace period gives the response time to come back.

vonnieda commented 6 years ago

Just following up on this: I used @theOzzieRat's fix for a couple months and it seemed to help when the connection was generally good, but still failed quite often when there was a slow or lossy connection. I've reverted back to the fix I wrote in https://github.com/fireboardlabs/paho.mqtt.embedded-c/commit/b50fbf70932ff899ee85aa8b9d86389b5c0529bf and that seems to be resolving the issue for me.

icraggs commented 5 years ago

I think that a correct implementation of the keep alive function is in the C++ MQTTClient library. The develop branch has an implementation which is almost identical, except for the ping response wait time.

template<class Network, class Timer, int MAX_MQTT_PACKET_SIZE, int b> int MQTT::Client<Network, Timer, MAX_MQTT_PACKET_SIZE, b>::keepalive() { int rc = SUCCESS; static Timer ping_sent;

if (keepAliveInterval == 0)
    goto exit;

if (ping_outstanding)
{
    if (ping_sent.expired())
    {
        rc = FAILURE; // session failure
        #if defined(MQTT_DEBUG)
            DEBUG("PINGRESP not received in keepalive interval\r\n");
        #endif
    }
}
else if (last_sent.expired() || last_received.expired())
{
    Timer timer(1000);
    int len = MQTTSerialize_pingreq(sendbuf, MAX_MQTT_PACKET_SIZE);
    if (len > 0 && (rc = sendPacket(len, timer)) == SUCCESS) // send the ping packet
    {
        ping_outstanding = true;
        ping_sent.countdown(this->keepAliveInterval);
    }
}

exit: return rc; }

abondarev84 commented 5 years ago

any guesses as to when this fix is going to hit release?

sebirdman commented 4 years ago

@icraggs it appears the current implementation of the develop branch for this fix has a bug that does not follow the c++ implementation nor the MQTT spec. It looks like:

                // Expect the PINGRESP within 2 seconds of the PINGREQ
                // being sent
                TimerCountdownMS(&c->pingresp_timer, 2000 );

should be

                // Expect the PINGRESP within one keepalive period of the PINGREQ
                // being sent
                TimerCountdown(&c->pingresp_timer, c->keepAliveInterval );

From what i can tell, this causes us to reconnect pretty often on bad wifi connections.

haneul49 commented 1 month ago

https://github.com/eclipse/paho.mqtt.embedded-c/issues/122#issuecomment-440647379 Here -> else if (last_sent.expired() || last_received.expired())

Had one doubt. What if I'm are continuously receiving data in my client, the last_received timer gets updated everytime but the last_send timer will be expired and hence everytime my keepAliveTime period expires it sends a ping, right ? But theoretically it is not required as there was a communication happening between the client and broker. Am I right. Can someone explain this situation.

eclipse / paho.mqtt.embedded-c

MQTTYield with low runtime timesout on slow connections #122