FlowFuse / nr-project-nodes

A set of Node-RED nodes for inter-project communication within the FlowFuse platform
Apache License 2.0
5 stars 0 forks source link

Project Call nodes stalling #74

Open knolleary opened 1 month ago

knolleary commented 1 month ago

Current Behavior

Reported by a self-hosted customer - originally here: https://github.com/FlowFuse/nr-project-nodes/issues/68#issuecomment-2099960179

We have been experiencing a problem of return messages suddenly not being delivered for some time now. Multiple versions of NodeRED, multiple versions of Project nodes, multiple versions of NodeJS (on the Windows server). We are unable to reliably recreate the problem. It appears to surface after some time of usage or # of messages (?) of the Project Call node. But hesitant to say this because I've experienced a stall after just 7 messages, while I've also seen it do 60+ without a problem. A (manual) restart of the calling instance (where the Project Call node is) solves the problem. Obviously, this isn't desirable in a production environment.

With an updated from the end of last week:

It does appear we are getting a time-out from the project call node when no return is received. The problem also appears to rear its head more when multiple people are working with the Dashboard front-end (triggering flows that use the project calls to start Powershell scripts on the Windows based NR instance), and in the case of yesterday he was constantly starting actions. This leads us to believe the number of messages sent over a project call node plays a role here.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

Linked Customers

Steve-Mcl commented 1 month ago

For context, see discussion originally posted here: https://github.com/FlowFuse/nr-project-nodes/issues/68#issuecomment-2099960179

We are using a (1 at this time, we plan to have several) Project Call node to send msg's to a NodeRED instance that is running on a Windows server, running FlowFuse Agent. A flow on the Windows server instance is used to run Powershell scripts, which also have an output that is sent back to the calling flow over the Project Out return.

We have been experiencing a problem of return messages suddenly not being delivered for some time now

And the only remedy when that happens is to restart the layer from which the project calls originate. While messages sent via a second project call (to the same instance as where the other one stalls) keeps working perfectly.

@SynoUser-NL

Would you be able to share demo flows from the projectlink-call and the subroutine (link-in~...~link-return) project?

Assuming it is not a huge amount, please include all nodes leading up to the projectlink-call and beyond AND all nodes between the the link-in~...~link-return nodes. Please also include any debug nodes you have added that you use for verifing the call was sent/received/returned (though do be sure to sanitise or obfuscate anything sensitive).

Thanks, Steve.

knolleary commented 1 month ago

Adding another update provided by the customer:

We are now able to see which message was sent last (using a queue) to the Project Call node that times out. I've also enabled a Dashboard on the Windows instance that shows us the last message sent to the Project Out\Return node. The time-out flow now triggers a message to a Project Test call (to the same Windows instance) and adds the message return time. When the main Project Call node stops responding, the test Project Call remains in working order.

It appears to us at this time that the Project Call node sometimes does not "catch" the return message sent by a Project Out return node. Thus failing to resume the flow that is built.

robmarcer commented 1 day ago

Some feedback from a FlowFuse user - https://app-eu1.hubspot.com/contacts/26586079/record/0-1/8977201

Hi Support,

I am confident this is the same as an existing customer reported problem here: "Project Call nodes stalling #74" - https://github.com/orgs/FlowFuse/projects/1?pane=issue&itemId=64911130

I just wanted to register an interest in the resolution. Also perhaps I can add some extra information, though you decide.

I've traced this through from logs on the ff-agent device and logs on the ultimate endpoint.

Here's a snippet to illustrate, from the agent hosted node-red that makes the link call:

2024/07/03 10:44:34Z WARN FlowFuse: Disconnected 2024/07/03 10:44:35Z INFO FlowFuse: Connected 2024/07/03 10:45:30Z WARN flowfuse server not answering 2024/07/03 10:45:30Z INFO Stored telemetry for replay 2024-07-03T10:45:00.000Z 2024/07/03 10:49:30Z INFO Flowfuse server: We'll try again... 2024/07/03 10:49:30Z INFO Re-submitted telemetry for 2024-07-03T10:45:00.000Z

I left the link call with a rather generous 30 second time out, bold above to show that's when the link code returned to the flow with a timeout that I catch. The posting would have occurred at 10:45:00 (ish).

I can tell you that in this case ff-cloud delivered the original posting to the endpoint at 10:45:14 - this in itself is unusual, as normally that data passes though without meaningful delay (within the same second, on the endpoint I do not have access to finer grained times). Unfortunately I need to do some work on the server instance to better log when something occurs that is not to plan, so I cannot yet tell you if there was a long delay in the posting hitting the link-in node.

The 10:49:30 posting completed in normal time, no delays.

These are not isolated incidences. And seem more common than a couple of months back (but I may not have been looking closely enough, so not sure).

I updated the project nodes packages to 0.7.0, NR is v3.1.10 ff-cloud, in in the example above, v3.1.9 for the agent.

Please let me know if you find a resolution for this.

The double postings are not too problematic right now, but ultimately I need to stop them. So the link response disappearing means I cannot know what was or was not actioned.