Current LWT mechanism doesn't quite give us a definitive state

vtgav commented 2 days ago

I've recorded the exchange around this, and dropped further emails as following comments.

Original question : (Gav)

Device records an LWT Device disconnects ( unplanned ) Broker sends LWT message to subscribers . . Time passes . . Device reconnects

Q. How does the subscriber know ? because MQTT doesn’t mandate any activity ( as far as I know ) – would the device send a status message ? ( which the subscriber can use to ‘do some stuff’ like request missing data etc. )

(Steve)

The LWT signals a broken connection. On connection/reconnection a Status message should always be sent so the device can be marked online by the SA. (This should be in the doc – isn’t it?).

I discussed a similar issue to this with Stu at the PSAC meeting. If a device disconnects gracefully (no LWT) then the SA is unaware.

(Mark)

My answer to Gavin’s original question would be that the SA knows the device has reconnected because of the Status message it sends. However it does not know when the device disconnects unless the device disconnects poorly and the LWT is sent.

(Gav)

Thx for the responses @Steve @Mark – and thinking about it more I’m a bit troubled by something Will Homer (@ Ovarro ) just said to me – so please correct me if I’m wrong here . . .

We’ve defined a separate LWT topic which is retained, so if a device disconnects ( breaks ) the broker will publish the LWT message.

The device reconnects and sends a status message – also retained

So, lets say we’ve gone through that cycle, so we have both a retained LWT now and a retained status - and let’s say the SA disconnects for some reason and then reconnects . . . the SA will get both the retained LWT message and the retained status message – and the order it gets them is not defined ( I don’t think ) so that means the SA may or may decide that device is disconnected and do some follow up activity.

I suppose my question here, assuming I’ve correctly interpreted what will happen, wouldn’t it be better for the LWT disconnect to write a status message ( which says its disconnected ) and hence you’d see the latest, correct status – rather than parallel, retained LWT and status messages which could mislead ?

vtgav commented 2 days ago

(Mark)

Given that MQTT is by its nature asynchronous, I would have thought it is very difficult to make an assessment that the Field Device is attached to the broker, other than maybe by send a request and getting an answer quickly? I do not think we have built into the Lucid specification a mechanism of knowing this for definite. If you all think we have then I think it should be explained as such, either in the specification or in an accompanying article?

My thoughts are that Lucid is asynchronous in the normal state of affairs and so SAs in general, cannot rely on a guarantee of the device being connected. Some SAs may publish to a topic and see a quick answer on a different topic to which they have subscribed then assume some level of connectivity, but other than that we do not explicitly support any such mechanism.

The specification says in Section 7.1 that the status message is sent on connection to the broker. So that covers off Steve’s point I think. However the diagram in section 7, seems to suggest and available/unavailable state which I am not sure we can really support? Maybe we need to describe what we mean by connection state, it is something to do with the SAs?

My answer to Gavin’s original question would be that the SA knows the device has reconnected because of the Status message it sends. However it does not know when the device disconnects unless the device disconnects poorly and the LWT is sent.

Cheers

Mark

vtgav commented 2 days ago

(Mark)

Hi Gav,

I think you are right. At the moment we would have a retained LWT and a retained status and without timings you cannot tell which came last.

I agree with your suggestion that if were to publish the lwt to the status topic with some indication of a break then you would know that the last status was a problem.

However, our status is not just a simple status message, it includes things like config number, which might change after opening the connection and hence an LWT message might not reflect the latest version or could not include this information. This would mean that we break the mechanism we have for determining if config has worked and what the current configuration version number is. So, unfortunately, I do not think that would work.

If you really want a definitive connection status, then we could have a separate topic called something like connectionstatus. We could publish “online” when connecting, publish “offline” when disconnecting normally and have an LWT which publishes “broken” or the like when a disconnect happens. I think this might work, but is not included in the current specification.

I think the current protocol specification does not permit you to definitively determine connected state. Do we need to raise a ticket about adding this?

My two-pennies 😊

Cheers

Mark

vtgav commented 2 days ago

(Gav)

Sort of ties in with some notes in here https://www.hivemq.com/blog/mqtt-essentials-part-9-last-will-and-testament/

vtgav commented 2 days ago

Thanks Gav and Mark,

Sparkplug adds a sequence number to the LWT (NDEATH) message which matches a sequence number of the NBIRTH message. I suppose we could use the Birth message timestamp – add that to the LWT message on connection?

Steve

vtgav commented 2 days ago

Hi Steve,

Interesting. Our birth (by which I assume you mean Status) message has a timestamp in, and if the LWT is used then the LWT also has a timestamp, which is set to when the connection is made (roughly). These could be used already to do what we are talking about as the two timestamps should be very close, if not the same.

The only thing we would be missing would be a disconnect status indication when the FD cleanly disconnects.

Cheers

Mark

vtgav commented 2 days ago

OK, I can see that being a workable solution . . .

however, I do think it’s a sort of sticking plaster for something that basically doesn’t work. We end up with a device ‘state’ that is effectively distributed between two topics and we have to compare message timestamps to figure which one to use – it works but it’s really ‘unclean’ 😉

So, I suggest we ought to mull it over, possibly rethink it, and consider how to make it simple and robust – we ought to have a single device status somehow ??

Gav

LucidProtocol / Lucid-Specification

Current LWT mechanism doesn't quite give us a definitive state #88