Sometimes messages sent over the Project Link nodes do not arrive

robmarcer commented 2 months ago

Current Behavior

It has been reported that sometimes messages sent via the FF Project Link nodes do not arrive at the other end.

Expected Behavior

Assuming there are not any networking issues, all messages should arrive at the intended destination assuming the flows have been configured correctly.

Steps To Reproduce

I can't reproduce this bug at this time but just raising an issue so we have a place to store customer reports and any theories about what might be causing this issue.

Environment

FlowFuse version:
Node.js version:
npm version:
Platform/OS:
Browser:

robmarcer commented 2 months ago

This customer reported this issue on Friday (26th April 2024) - https://app-eu1.hubspot.com/contacts/26586079/record/0-1/1956

robmarcer commented 2 months ago

I'm setting up a test based on my devices demo here - https://app.flowfuse.com/instance/cbdcbf3a-da70-468c-941e-c333ea1a0e43/overview

The demo consist of 65 devices (~6000 miles away from FF Cloud) which reply to a 'ping' from a NR instance running on FF Cloud. The pings are sent every 5 seconds. I will update the demo to alert me if any of the pings do not make it back to the instance on FF Cloud.

robmarcer commented 2 months ago

I am seeing evidence of occasional missing messages, this API returns each ping and a count of devices which responded. If it's less than 64 something failed - https://hmi-development.flowfuse.cloud/export

I think it would be worth someone validating how I'm producing this data, totally possible there is a bug in my flows.

knolleary commented 2 months ago

We need to correlate any message drops with the underlying connectivity of the nodes. My theory is the nodes are having their ws-mqtt connection bounce during which time the node doesn't do any store/forward whilst disconnected. I'm not sure that's something easy for you to do with the nodes as-is. I'll have a think on how we can debug this.

knolleary commented 2 months ago

I was mistaken about my theory - the mqtt library we use does do store/forward by default. Did a quick local test where I dropped the device's mqtt connection whilst continuing to send messages from it. Once it reconnected, the messages were forwarded on without any dropped.

In this test, the project nodes were sending to a hosted instance rather than another device. Next I'm going to look at the receive side of the equation - do the messages get discarded if the subscriber goes offline.

knolleary commented 2 months ago

Can confirm the messages are discarded if the subscriber is not connected. Need to pick through our connection settings here. Any changes we make will potentially mean the broker has to start storing messages indefinitely for offline devices - that can become unmanageable. Will need to look at both the project node connection settings (clean session/qos etc) as well as broker configuration around persistent state and queue depths etc.

robmarcer commented 2 months ago

This has also been reported by https://app-eu1.hubspot.com/contacts/26586079/record/0-1/3995301

SynoUser-NL commented 2 months ago

Hi,

We are using a (1 at this time, we plan to have several) Project Call node to send msg's to a NodeRED instance that is running on a Windows server, running FlowFuse Agent. A flow on the Windows server instance is used to run Powershell scripts, which also have an output that is sent back to the calling flow over the Project Out return.

We have been experiencing a problem of return messages suddenly not being delivered for some time now. Multiple versions of NodeRED, multiple versions of Project nodes, multiple versions of NodeJS (on the Windows server). We are unable to reliably recreate the problem. It appears to surface after some time of usage or # of messages (?) of the Project Call node. But hesitant to say this because I've experienced a stall after just 7 messages, while I've also seen it do 60+ without a problem. A (manual) restart of the calling instance (where the Project Call node is) solves the problem. Obviously, this isn't desirable in a production environment.

We can see the Powershell scripts appears to be running start to finish (according to logging), but a return message after the script is done is not received by the Project Call node when or after a stall happens. The output of the node that is starting the Powershell script is connected directly (both stdout and stderror) to the Project Out return node. The Project Call node times out when no return message is received. Other than that, there is no indication that anything is wrong.

Last week I implemented a test flow on the instance running on the Windows server. On (manual) inject from the NR instance all project calls are made from, it sets a timestamp, sends it over a Project Call node to the Windows server instance, sets a timestamp there and returns to the calling project node where the time difference is calculated. When the Project Call node that is responsible for running the Powershell scripts appears to stall, this test node keeps returning messages. So it appears not all project connections are affected when one stalls. And this also means there is no connection problem (MQTT or otherwise) at the time of a stall.

Last Friday, I've updated the NodeJS version on the server we're running the FlowFuse Agent on (and where Project In and Out nodes live). It is now running NodeJS 20.12.2, NR 3.1.9, and we have not seen any stalls yet. I'm also restarting the calling node instance every morning at 6.00 hrs to (hopefully) prevent any issues.

But too early to tell anything definitive because usage hasn't been that much due to holiday. And to be clear: messages sent to the Windows instance over the Project Call node are always received, it's only the returns we occasionally have problems with.

Yesterday evening I saw there was an update to Project Nodes (version 0.6.4) which I installed on all instances (which I did have to do twice on all instances, strangely enough..).

Hope all this helps with troubleshooting. I'm not sure if there is anything more I can do, but if I can be of any assistance with further information please let me know.

Thanks!

knolleary commented 2 months ago

@SynoUser-NL THanks for the information. The 0.6.4 release included the fix for a specific issue where messages would not be queued up for the nodes if they had temporarily dropped their connection.

From your description, that doesn't quite feel the same symptom - unless the nodes are disconnecting under the covers; would be good to check the Node-RED logs for any suggestion of a disconnect.

Let us know how you get on with 0.6.4 - if the problem persists we'll get a new issue raised to focus on your scenario.

SynoUser-NL commented 2 months ago

check the Node-RED logs for any suggestion of a disconnect.

@knolleary Welcome of course. The logs show no signs of disconnect, on either end.

I agree, this doesn't quite feel like the same issue. Next week will be a lot busier again, so hope to be able to give some more information on this then.

Thanks!

SynoUser-NL commented 1 month ago

Hi,

I'm sorry to say it appears we're still experiencing problems with Project node replies stop coming through sometimes. And the only remedy when that happens is to restart the layer from which the project calls originate. While messages sent via a second project call (to the same instance as where the other one stalls) keeps working perfectly.

How would we proceed from here to find a permanent solution?

Thanks, Den

FlowFuse / nr-project-nodes