Is re-publishing messages after clean_state=True a deliberate design decision? How to handle Session Present = False?

eclipse / paho.mqtt.python

paho.mqtt.python

Other

2.19k stars 723 forks source link

Is re-publishing messages after clean_state=True a deliberate design decision? How to handle Session Present = False? #733

Closed couling closed 2 months ago

couling commented 1 year ago

I'm hunting for an issue explaining this in the documentation (found here), but I've not found one:

Also when clean_session is True, this library will republish QoS > 0 message across network reconnection. This means that QoS > 0 message won’t be lost. But the standard say that if we should discard any message for which the publish packet was sent. Our choice means that we are not compliant with the standard and it’s possible for QoS 2 to be received twice. You should you clean_session = False if you need the QoS 2 guarantee of only one delivery.

Was this a deliberate design decision or is it still up for someone to offer a "fix"?

I'm concerned because even though it might be possible to work around clean_session=False simply by creating a new client, I don't see any work around for Session Present = False 3.2.2.2 Session Present.

That is if the MQTT broker loses the session for any reason, including administrative action, the client will re-send duplicate QOS 2 messages and the broker will have no mechanism to de-duplicate them. I don't see a way for code using Paho to handle Session Present = False itself to prevent duplicate QOS 2 messages.

MattBrittan commented 9 months ago

But the standard say that if we should discard any message for which the publish packet was sent.

This does not look quite correct to me. QOS0 messages are not acknowledged (so we have no way to confirm if the broker has received them) meaning that once a QOS0 message has been transmitted it's discarded.

However if the connection is lost before the message is transmitted (or is down when publish is called) then the message will be queued and transmitted when the connection comes up (I think this is what the quote is referring to). I don't believe there is anything in the spec that specifically prohibits this (the session state section states that it may include "Optionally, QoS 0 messages pending transmission to the Client.") (and nothing in the spec really states what the client side library should do).

Creating a new client would prevent the messages being resent (because this library holds the state in memory; a new client will have a fresh state).

That is if the MQTT broker loses the session for any reason, including administrative action, the client will re-send duplicate QOS 2 messages and the broker will have no mechanism to de-duplicate them.

I believe the library may deliver a message twice; this could occur when:

Message is received; handler called
Connection drops before server receives PUBREC
Connection resumed; server resends PUBLISH

This could be addressed by checking _in_messages[message.mid] to see if we are are already waiting for the PUBREL before calling the handler (so the handler is not called the second time the message is received).

Hopefully this helps a bit; note that I don't usually work on this library (just trying to help tidy up issues) so there may be errors in the above.

couling commented 7 months ago

@MattBrittan not really sure what more info is needed. It's a shame this didn't make the cut for V2.0 because this may end up being a breaking change. Not sure.

Reading between the lines Paho is not correctly handling the "session present" flag in CONNACK packet 3.2.2.2, and it's certainly not honouring the definition of QOS 2 as "Deliver Exactly once" when it duplicates a message without permission from the calling application.

The correct behaviour here is hinted at in the standard (3.2.2.2):

In the event that the value of Session Present received by the Client from the Server is not as expected, the Client can choose whether to proceed with the Session or to disconnect.

All Paho needs to do here is to disconnect and raise an exception when clean_state=False and the broker responds with "session present 0". Then it's up to the application developer what they should do next.

Suggested Approach

To avoid backward compatibility issues, make the current retry logic on "session present 0" an optional "feature". Let's call it unsafe_retry. For backwards compatibility unsafe_retry=True can be the default.

Then if unsafe_retry == False and clean_state=False and the server responds with "session found 0": Paho can immediately close the connection before any further retry of any sort and raise an exception.

Ideally, Paho would offer some way for the application to know what is stuck in the outbound queue at that point.

After that it's on the application developer. They can discard the Paho session and begin a new one, and it's on them to chose what to send and what not to send that was previously stuck in the outbound queue.

MattBrittan commented 7 months ago

Apologies; it looks like I may have partly misinterpreted your issue (was working through a lot of old issues at the time and must have misread the quote). I ended up focusing mostly on QOS0 which is not really relevant here. I then talked about issues handling QOSS2 messages that were being received (which is an issue, but you are focusing on sent messages).

So with a focus on QOS1+ messages being transmitted from the client, when clean_session = True...

Firstly I think it's worth noting that, if I'm understanding you correctly, this will be a pretty rare situation. The following would need to occur:

Client connects with clean_session = False (meaning the session will outlast the connection)
Client publishes QOS1+ message (has not received ACK)
Client closes connection
Client re-connects with clean_session = True

(there is an alternative, which you allude to in your response ref "session present, where the broker clears the session, generally this would only happen after a relatively lengthy loss of connection or if a broker without a persistent session store is restarted).

Should the above happen then any QOS1/2 guarantees are really moot. The client has no way of knowing (or finding out) whether the message it published was received (the server will have effectively thrown away all info regarding the client). As such the choice to resend does not seem unreasonable (generally a duplicate message is better than no message; QOS2 will be broken either way due to the removal of the session state).

Reading between the lines Paho is not correctly handling the "session present" flag in CONNACK packet 3.2.2.2,

Interesting the V5 spec has something to say on this (it's more strict than the section of the V3 spec you quoted):

If the Client does have Session State and receives Session Present set to 0 it MUST discard its Session State if it continues with the Network Connection.

It does not look like this client currently adheres to this rule (the call to _messages_reconnect_reset_out should probably happen after the CONNACK is received rather than before opening the connection).

Then it's up to the application developer what they should do next.

You can actually handle this yourself if you want in the on_connect callback (this is called before any messages are sent and could clear the queue).

Anyway sorry about misunderstanding initially; I hope I've provided info relevant to your question this time! This is certainly something that could be addressed but I'm not sure that its a priority (and think there is a way of achieving what you want without changes to the library).

couling commented 7 months ago

You can actually handle this yourself if you want in the on_connect callback (this is called before any messages are sent and could clear the queue).

Sounds promising. Let's stress test the idea. Are you able to tell me:

Does the on_connect callback happen before any retry? (allowing the callback to closeand prevent Paho from retrying QOS2 messages)
What would happen to blocking publish() calls? Would they fail on an exception only after the on_connect callback deliberately closes the connection?
On exception, is there any way for blocking publish() to determine if the message made it to the wire or not. There's a difference here between failing while waiting in a queue, vs failing while working through the 4-way QOS2 transmission.

(there is an alternative, which you allude to in your response ref "session present, where the broker clears the session, generally this would only happen after a relatively lengthy loss of connection or if a broker without a persistent session store is restarted).

No. A simple re-deploy of the the broker can commonly trigger it. If the out-going broker instance doesn't wait to complete all QOS2 messages, then anything "in-flight" is going to get lost. Remember that QOS2 requires 4 messages to complete, so this is kinda likely if the broker shuts down too quickly....

... depending on the broker, the new incoming broker won't be handed any session state information. So all clients just get told "session found 0".

Should the above happen then any QOS1/2 guarantees are really moot. The client has no way of knowing (or finding out) whether the message it published was received (the server will have effectively thrown away all info regarding the client). As such the choice to resend does not seem unreasonable (generally a duplicate message is better than no message; QOS2 will be broken either way due to the removal of the session state).

No. That's the really dangerous assumption I'm getting at. You are assuming that duplication is safe for a message flagged as "exactly once". The client has explictly marked the message as "do not duplicate", accepting a performance hit for flagging it such. So assuming that duplication is okay and pragmatic seems very wrong indeed.

Please see the post office scandle in the UK for explanation of why silently duplicating records can be bad. While I'm not suggesting anyone use MQTT for financial records, my point is that Paho has absolutely no way to know what damage it's doing by duplicating a message that was supposed to be "Exactly once".

You are right that we are discussing an unrecoverable situation. So the better thing to do, much better thing to do, is tell the application and let it figure out what to do next.

MattBrittan commented 7 months ago

Sounds promising. Let's stress test the idea. Are you able to tell me:

on_connect is called when CONNACK is received during the connection process (after checking it's not an outright rejection) so applies to all connection attempts. Tracking the status of an outgoing message is done with MQTTMessageInfo; this is not, as far as I can see, updated when a message is removed from the queue without being published. There is certainly room for improvement here, wait_for_publish provides a timeout option to cater for undelivered messages but that's not really a great solution).

No. A simple re-deploy of the the broker can commonly trigger it.

If your broker does not retain session information when re-deployed then you run the risk of message loss ("Exactly once delivery" is not achievable in this situation).

You are right that we are discussing an unrecoverable situation. So the better thing to do, much better thing to do, is tell the application and let it figure out what to do next.

Unfortunately ease of use dictates that some decisions be made for the user (a lot of library users have minimal knowledge of the protocol and just expect the library to work). MQTT appears, at first glance, to be very simple however there are a surprising number of edge cases and libraries get used in ways the authors never considered.

Anyway I believe it's possible to achieve your goal using on_connect (realise it's not ideal and would involve a bit of work on your side). Adding functionality to the library is an option, and a PR would be welcome (but probably best to raise an issue with a proposed design prior to embarking on this). I believe this would probably need to take the form of returning an error in wait_for_publish and moving the call to _messages_reconnect_reset into _handle_connack (would probably suggest a more descriptive flag than unsafe_retry :-) ).

I'll remove the "More info needed" flag but please do consider raising this as a new issue that includes elements of this discussion (due to the number of old issues I'm not sure how many people will review this one).

couling commented 2 months ago

I've finally found time to evaluate the on_connect. It can be used to prevent duplication but still leaves a problem over knowing which messages failed to send.

@MattBrittan

I still think current behaviour is, very bluntly, wrong!

Unfortunately ease of use dictates that some decisions be made for the user (a lot of library users have minimal knowledge of the protocol and just expect the library to work). This is no edge case! It's a core feature of MQTT described in the abstract!:

This is no edge case, it's a core feature described in the abstract of the standard.

"Exactly once", where messages are assured to arrive exactly once. This level could be used, for example, with billing systems where duplicate or lost messages could lead to incorrect charges being applied.

Silently corrypting a user's data is not "ease of use". Sometimes things are hard no matter your design, and the right thing to do is put guard rails with well sign posted error messages so that users don't shoot themselves in the foot.

MQTT appears, at first glance, to be very simple however there are a surprising number of edge cases and libraries get used in ways the authors never considered.

Oof. I can't agree with that. Maybe in earlier 3.x but not in 5. The standard is centered on some very simple core concepts which don't have much overlap to cause curious edge cases. However it is very easy to misread the protocol standard and orient a client API around some subtle misunderstanding. This then makes the API have surprising behaviour to the user.

Case in point, the standard has a lot devoted to reconnection and maintaining a session state accross reconnects ensuring it's consistent. Yet Paho's design seems to ignore a lot of the founational understanding of this. Eg: it makes very little sense for paho to have built in auto-reconnect while expecting the user to directly set clean_start. If paho doesn't have a session state held in memory it should set clean_start=1 to ensure the client and server state match, and then set clean_start=0 on any subsiquent reconnect.

MattBrittan commented 2 months ago

I still think current behaviour is, very bluntly, wrong!

There will be numerious things in this library that are "wrong" because it's evolved over years (with numerous authors all having different needs) and, as we saw with the V2 release, change is hard!. So we need to be pragmatic here in terms of finding a solution that meets your needs (unless you have the time to write a new version that addresses the failings; this is the approach we attempted with the Go Client and I still have not got it to v1.0 :-) ). Pull requests are welcome but do need to be mindful of breaking users existing code.

If paho doesn't have a session state held in memory it should set clean_start=1 to ensure the client and server state match, and then set clean_start=0 on any subsiquent reconnect.

This is actually a good example; many users don't really care about "exactly once" delivery, but do want to receive messages that have come in since their app was restarted (so clean_start=0 is needed for the initial connection, or those messages are lost). This is not in compliance with the spec (as the client does not, in fact, have a session) but is what a lot of users will expect (and a pragmatic solution given the client does not store the session to disk). There is not really a right answer here....

Anyway I think it's best to focus on your specific issue. My thought was that in on_connect you could clear the queue. However I really would suggest opening a new issue (ideally with a proposed solution) because it's more likely to get noticed (I'm not actively developing this library, just trying to help manage the issues as they had got a bit out of control).

couling commented 2 months ago

There will be numerious things in this library that are "wrong" because it's evolved over years (with numerous authors all having different needs) and, as we saw with the V2 release, change is hard!

Yup, I know. You'll note my earlier comments back in Feb. You'll note my earlier comments

It's a shame this didn't make the cut for V2.0 because this may end up being a breaking change.

What I meant by this was that when I first filed the issue, it was because I was willing to file some PRs myself, but needed to understand the chance of them getting approved before I sunk 2 weeks of business hours into fixing them. Getting PRs dismissed because a maintainer isn't willing to have a particular problem fixed is a really frustraiting way to burn time.

The problem isn't the time to write them, it's understanding that there is any willingness to accept such changes at the end of the process. It's the oppertunity for some kind of steer on what form the changes should take.

Sadly nobody responded at all prior to 2.x being released and so I'm guessing that getting potentially breaking changes through are far less likely to be approved.

This is actually a good example; many users don't really care about "exactly once" delivery, but do want to receive messages that have come in since their app was restarted (so clean_start=0 is needed for the initial connection, or those messages are lost). This is not in compliance with the spec

Indeed it is a good example. What you say there reinforces my reason for not filing any PRs yet: What's needed are a bunch of smaller changes some breaking a use case with others fixing it again from a different angle.

I'll close this now in favour of #855 and seek to get some time to offer PRs towards that.

MattBrittan commented 2 months ago

Excellent - thanks for the new issue. One suggestion - perhaps note at the top that you are prepared to work on this and are looking to maximise the chance of a PR being accepted (I'm not sure if anyone else is currently in a position to take on a project like this, it will be a fair amount of work).

eclipse / paho.mqtt.python

Is re-publishing messages after clean_state=True a deliberate design decision? How to handle Session Present = False? #733

Suggested Approach

On exception, is there any way for blocking publish() to determine if the message made it to the wire or not. There's a difference here between failing while waiting in a queue, vs failing while working through the 4-way QOS2 transmission.

On exception, is there any way for blocking `publish()` to determine if the message made it to the wire or not. There's a difference here between failing while waiting in a queue, vs failing while working through the 4-way QOS2 transmission.