Project Nodes can get into a state of creating multiple MQTT connections

knolleary commented 5 months ago

Current Behavior

Reported by @robmarcer, whilst developing flows and deploying frequently, the Project nodes started getting into a connect/disconnect cycle.

Checking the broker logs shows the client is attempting to make multiple connections to the broker - causing the older connection to be kicked off, which then reconnects and kicks the other connection off and so the cycle continues.

Restarting the instance 'fixes' it.

There must be a timing window around the connection handling where we aren't closing a connection properly.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

FlowFuse version:
Node.js version:
npm version:
Platform/OS:
Browser:

Steve-Mcl commented 5 months ago

@robmarcer could you please share your workflow? For example, where you using devices? app or instance bound devices? was this on staging/production?

Would you still have this environment somewhere?

So far, I am unable to recreate in local env.

Steve-Mcl commented 5 months ago

For anyone reading along.

In local dev, I have tried with multiple projects, with project links not firing, with links firing fast/small/large messages continuously while restarting, reloading, deploying (at both sides of project links) many many times over.

I have thus far failed to recreate this.

This led me to the possible thought that perhaps a device (or instance) was firing up 2 same Node-RED instances and that was the cause of multiple same connections however, comments on https://github.com/FlowFuse/device-agent/pull/236 effectively rule out devices and potentially instances running localfs (due to port number reuse preventing a 2nd instance)

This leaves me with instances in docker/k8s firing up duplicates either by way of an nr-launcher issue or some other timing issue OR a tricky trickster bug in the project nodes themselves (one which I have so far failed to recreate as mentioned firstly)

To avoid spending much more time on this, I will wait for Rob to feedback with what he can remember about the situation.

(unless anyone else can see what I am not?)

hardillb commented 5 months ago

This is a link to the Application in @robmarcer 's team that was showing the problem.

https://app.flowfuse.com/team/ff-demo-2024/applications

I can't remember off the top of my head if it was the device or the the cloud application that had the problem. I think it was the cloud instance called HMI-Development.

Let me know if you need me to use the admin user to invite you to that team

robmarcer commented 5 months ago

Hello

The error shows on the cloud instance. It tends to be ok for an hour or two after I restart it (via the FF interface). If anyone needs access to the project please ask and I'll add you to the team.

Steve-Mcl commented 5 months ago

Hello

The error shows on the cloud instance. It tends to be ok for an hour or two after I restart it (via the FF interface). If anyone needs access to the project please ask and I'll add you to the team.

Please add me @robmarcer

Can you clarify where in the flows and if this was witnessed between instance to instance or will I need to setup devices etc?

any basic step instruction you can add will help narrow it down Rob. (including type of deploy full/flow/node, approximately how frequent changes/deploys were made, any other specific conditions)

Thanks.

robmarcer commented 5 months ago

I've added you.

The issue happens here - https://hmi-development.flowfuse.cloud/?#flow/975f5ba20d719fd3

After a while the project out and in nodes at the top of that flow disconnect, I think it actually breaks during a deploy but I can't be sure if that's the case.

Steve-Mcl commented 5 months ago

Since around 9:30 I have prodded and poked this instance and D1 (one of the attached devices) and cannot get this to fail :(

Signing off for the day soon, will come back to this tomorrow.

If anyone knows how we can get additional instance logs (I checked production loki logs but not seeing anything relevant) for the times between 0:5:25 ~ 05:32 on 31-01-2024, please ping me.

joepavitt commented 4 months ago

@Steve-Mcl did you get a chance to revisit this at all?

Steve-Mcl commented 4 months ago

I did but I simply cannot make it occur. Which makes me think this issue is the result of external happenings or it has been fixed.

If Rob can remember further details or can recreate it, we can jump back on it.

robmarcer commented 4 months ago

If I see the issue again I'll leave everything in a broken state so you can take a look.

joepavitt commented 4 months ago

Will assign to Rob and get this off @Steve-Mcl's desk for now.

Steve-Mcl commented 3 months ago

There were 2 more recent customer support tickets raised that match this issue. As a result, I tracked this down and a PR is incoming.

Extended version of the above:

After littering the code with logging, adding UUID to the connection objects, adding test code & delays, forced network outages, timely MQTT restarts and DOS, I managed to re-create this issue reasonably reliably (~50% time)

Quite simple (and understandable to a degree) in the the end:

block or kill MQTT
start MQTT
before reconnect triggers, do a NR full deploy

An alternative to this was:

block or kill MQTT
Wait for the first client event
do a NR full deploy
start the broker

Due to the many moving parts (a deploy, a network outage/delay, an MQTT restart, an MQTT auth delay, the MQTT client reconnect timing etc etc) I added 2 means of containment:

push the newly connected mqtt client into an array & when an MQTT client connect/disconnect/reconnect event occurs, scan the array and force-end any that are NOT the current connection (i.e. !== to the client stored in the internal mqtt singleton client member property
ensure all paths on the Node-RED .on('close' lead to client.end()
- In effect this should be enough and step 1 should never occur however I managed to catch this a couple of times (super edge case timing)
- Even after end() has been called (successfully or otherwise) there now also is a final call to client.end(force) - it makes no difference and will 99.9% of the time have ended cleanly but in the event of internal state issues with the client connection this final call absolutely ensures the client never tries to reconnect.

With logging (and a fair bit of lucky timing), I can catch both 1 & 2 and (so far) the containment they provide rectify the condition to a single connection.

Steve-Mcl commented 3 months ago

Verification

@hardillb, Any idea how to test this on staging? As it is, staging loads (for example) "@flowfuse/nr-project-nodes": "0.6.3-dacd505-202403201005.0"

Locally, these have been hammered to death as per above info

hardillb commented 3 months ago

The version with the fix is 0.6.2, staging increments the patch level by one when it does a build, so that should have the fix.

Can't really say how to get it to fail specifically on staging

hardillb commented 2 months ago

Not sure how to verify this.

External customer with similar problem has reported they are no longer seeing the problem.

@robmarcer have you tested your environment with this recently?

Steve-Mcl commented 2 months ago

0.6.2 has been in the wild for around 3 weeks and has been downloaded > 1500 times.

coupled with the positive feedback and extremely tricky timing to catch / recreate the original issue, I think we can safely move this into the "done" column @hardillb?

hardillb commented 2 months ago

@Steve-Mcl yes

robmarcer commented 2 months ago

I've not observed this problem for weeks even though I'm still using the same demo which first triggered it.

FlowFuse / nr-project-nodes