ADT 2.X:3.0: Add ack on messages measured under KPI by Ruter

NiclasLindgren commented 1 year ago

As Ruter uses for instance APC door data to calculate KPI for passenger quality, but the MQTT layer can drop messages if a second message is published while one is in transit (i.e. cannot be forwarded due to network dropped momentarily). As it is basically not possible to rely on MQTT layer for this especially not as it is setup. And even if would be working in MQTT, the receiving application can be offline even if the brokers are up.

To avoid this, an application layer ACK should be added to all required messages, maybe based on traceId, but preferably maybe also adding a sequence number to these message similar to the AVL message.

This way, the PTO can guarantee data delivery to Ruter and have better trust in KPI measurements.

Suggested minimum messages for ack

DoorsIndividually
AssignmentAttempt/SignOn

AVL should also be considered, ff real-time acking isn't preferred for AVL an offline batch upload function could be used instead, where a history log is uploaded daily of all positions sent.

eivindga commented 1 year ago

Ruter is intending to create a operator portal where KPI measurements will be available in real-time, and the underlying data used to calculate the KPIs will be available for browsing / download by the PTO.

Most of these topics mentioned should be sent using QOS 1, which guarantee that the data will arrive to the recipients which are connected. In case the recipient (Ruter) is offline while the mqtt-broker is up, there is not much help in re-sending the data later as the KPI has strict time requirements for when the data should arrive. These cases will most likely have to be manually handled.

AVL should also be considered, ff real-time acking isn't preferred for AVL an offline batch upload function could be used instead, where a history log is uploaded daily of all positions sent.

The operator portal will provide an endpoint for downloading all data related to KPIs. Due to the size of the data, there will be a limit to how much historical data we can provide. The exact limits are still to be determined.

NiclasLindgren commented 1 year ago

"Most of these topics mentioned should be sent using QOS 1, which guarantee that the data will arrive to the recipients which are connected."

This statement is not true, it does not guarantee that, only if you have not published another message on the same topic. We are talking of the scenario where you intermittently lose mobile connection which you do every now and then.

"there is not much help in re-sending the data later as the KPI has strict time requirements for when the data should arrive" You have different KPIs, to fail the whole APC KPI because you have lost one message is probably not a good idea. And I see a lot of value in having full fidelity in the follow up, don't mix that with real-time.

So if you want to have a proper follow and a re-send or after loading should be considered, or you need to change the approach how to view KPIs. Why make a bad KPI due to intermittent connectiton issues that happened to send 2 message withing 4 seconds of a connection loss?

Most other systems we have integrated with have this ack scenario for data that is not only use in real-time but also for follow up. And on top of that you have the problem with the MQTT layer used here

eivindga commented 1 year ago

I think there is a misunderstanding here. Mqtt supports intermittently losing mobile connections and re-sending of messages. This is a very common use case.

There are multiple implementations of the mqtt protocol, and depending on the software you are running there are different settings that could be adjusted to increase the buffer / queuing of unsent messager.

Please see: http://www.steves-internet-guide.com/mqtt-client-message-queue-delivery/ https://www.hivemq.com/blog/mqtt-essentials-part-7-persistent-session-queuing-messages/

Message buffering is done in memory and most clients will buffer or queue messages by default. The client will usually have a setting that governs buffering, and the python client and node-red client defaults to unlimited. That doesn’t mean that all messages will be queued as queueing will eventually fail as the memory is consumed. When this buffer is full the client has no option but to overwrite existing messages or discard new messages.

If you experience messages not being resent when loosing the connection intermittently, some areas to look into should be:

message queue size settings
not enough memory in mqtt bridge (hardware limits)
moving queue from memory to persistent disk.

RuterNo / adt-doc

ADT 2.X:3.0: Add ack on messages measured under KPI by Ruter #146