eclipse / paho.mqtt.python

paho.mqtt.python
Other
2.13k stars 723 forks source link

Are PUBACKS from a previous session send due to clean_session=False considered valid control packets for keep alive? #766

Closed harrandt closed 6 months ago

harrandt commented 6 months ago

I see a strange behaviour using the Paho Python client (paho-mqtt 1.6.1) in combination with the Java Moquette broker.

  1. Connect to the broker with clean_session=False and publish lots of messages with QoS=1 to make sure the broker will be busy for quite a while processing queue
  2. Disconnect from the broker
  3. Connect to the broker with the same client ID and clean_session=False
  4. Start publishing
  5. Now there are lots of PUBACK to messages from the previous session that the broker has to send before the ones to the current PUBLISH.
  6. But I can see that after a while the client is sending a PINGREQ, so he does not seem to be happy with the "old" PUBACK packages as

The broker seems to queue the PINGREQ after the "old" messages, thus he will process the messages from the previos session first before answering it. And depending on the keepalive interval the client will drop the connection. Even though there were lots of PUBACKs from the broker.

The spec mentions

In the absence of sending any other Control Packets, the Client MUST send a PINGREQ Packet. [MQTT-3.1.2-23]

So my suspicion is, that the Paho client only considers PUBACK from the current session as Control Packets that will reset the keep alive period.

harrandt commented 6 months ago

I just had a look at the source code and think there might be a different explanation, but I have not understood it in full.

if self._sock is not None and (now - last_msg_out >= self._keepalive or now - last_msg_in >= self._keepalive):

https://github.com/eclipse/paho.mqtt.python/blob/master/src/paho/mqtt/client.py#L2582

I have a publisher-only client, so according to this the age of the last_msg_out will trigger a PINGREQ if above keepalive when _loop() is called.

So what's written here is actually true:

You may think that sending messages on a regular basis would stop the PING messages but it turns out that the keep alive interval applies to both publish and receive messages.

harrandt commented 6 months ago

Ok, I think I am getting to the bottom of this, the PUBACK that I receive are for a new client instance with no memory of the previous session. So the PUBACK I receive are probably for different message IDs used in the previous session and so they might or might not be used in the current one.

The client does not know which PUBACK it receives, and it will start the message IDs for the current session at 1 counting up.

So, if a PUBACK arrives, it could be one from the previous session which is also used for this session and it might not be the PUBACK for a message in the current session, which we are waiting for.

So we might never get it for the current message, right? So will the current message be resent and receive "its" PUBACK then?

MattBrittan commented 6 months ago

So we might never get it for the current message, right? So will the current message be resent and receive "its" PUBACK then?

Unfortunately this client currently only stores session information in memory. If you create a new Client this info is lost.

This means that, in your example, the new Client has no knowledge of the messages sent previously (the client should handle this correctly when using it's own reconnection functionality). When the client it receives a PUBACK it will compare the ID to a map of those it has sent and, if it's found, the user code will be notified and the ID removed from the map. If the ID is unknown then the PUBACK is effectively ignored.

Ad an aside; the fact that the server is sending all of the PUBACK's is interesting upon reconnection. My reading of the spec is that the V3 allows (but does not require) this, whereas V5 does not allow it. The V5 spec states that the server "MUST resend any unacknowledged PUBLISH packets (where QoS > 0) and PUBREL" and "Clients and Servers MUST NOT resend messages at any other time".

In terms of how this impacts keepalives; I don't believe it should. The timer is reset when any packet is received (regardless of any errors whilst processing it). Looking at the code that handles the PUBACK it looks like the ACK is just ignored if the ID is not in the dict.

I can think of one potential cause for the loss of connection (but this requires that the client is receiving only, not publishing). As per the spec:

It is the responsibility of the Client to ensure that the interval between Control Packets being sent does not exceed the Keep Alive value. In the absence of sending any other Control Packets, the Client MUST send a PINGREQ Packet [MQTT-3.1.2-23].

So the following could happen:

This should not be the case in your situation (as you mention that the client resumes publishing). Unfortunately without access to logs it's going to be difficult to diagnose further.

harrandt commented 6 months ago

Your assumption was correct. We were publishing towards a some customized MQTT broker and asked the provider about some internals. It is a customized Java Moquette with a stupid slow sync queue, so basically our PINGREQ got queued and delayed among all the other stuff that had to go to the database (slow...). That in combination with "unknown id" PUBACKs from the previous session led to the issue. But it is not an issue of Paho, so I will close this.

Basically our problem was, that the producer was to fast and the consumer to slow because of its blocking DB processing queue.