Closed knolleary closed 3 months ago
@robmarcer could you please share your workflow? For example, where you using devices? app or instance bound devices? was this on staging/production?
Would you still have this environment somewhere?
So far, I am unable to recreate in local env.
For anyone reading along.
In local dev, I have tried with multiple projects, with project links not firing, with links firing fast/small/large messages continuously while restarting, reloading, deploying (at both sides of project links) many many times over.
I have thus far failed to recreate this.
This led me to the possible thought that perhaps a device (or instance) was firing up 2 same Node-RED instances and that was the cause of multiple same connections however, comments on https://github.com/FlowFuse/device-agent/pull/236 effectively rule out devices and potentially instances running localfs (due to port number reuse preventing a 2nd instance)
This leaves me with instances in docker/k8s firing up duplicates either by way of an nr-launcher issue or some other timing issue OR a tricky trickster bug in the project nodes themselves (one which I have so far failed to recreate as mentioned firstly)
To avoid spending much more time on this, I will wait for Rob to feedback with what he can remember about the situation.
(unless anyone else can see what I am not?)
This is a link to the Application in @robmarcer 's team that was showing the problem.
https://app.flowfuse.com/team/ff-demo-2024/applications
I can't remember off the top of my head if it was the device or the the cloud application that had the problem. I think it was the cloud instance called HMI-Development.
Let me know if you need me to use the admin user to invite you to that team
Hello
The error shows on the cloud instance. It tends to be ok for an hour or two after I restart it (via the FF interface). If anyone needs access to the project please ask and I'll add you to the team.
Hello
The error shows on the cloud instance. It tends to be ok for an hour or two after I restart it (via the FF interface). If anyone needs access to the project please ask and I'll add you to the team.
Please add me @robmarcer
Can you clarify where in the flows and if this was witnessed between instance to instance or will I need to setup devices etc?
any basic step instruction you can add will help narrow it down Rob. (including type of deploy full/flow/node, approximately how frequent changes/deploys were made, any other specific conditions)
Thanks.
I've added you.
The issue happens here - https://hmi-development.flowfuse.cloud/?#flow/975f5ba20d719fd3
After a while the project out and in nodes at the top of that flow disconnect, I think it actually breaks during a deploy but I can't be sure if that's the case.
Since around 9:30 I have prodded and poked this instance and D1 (one of the attached devices) and cannot get this to fail :(
Signing off for the day soon, will come back to this tomorrow.
If anyone knows how we can get additional instance logs (I checked production loki logs but not seeing anything relevant) for the times between 0:5:25 ~ 05:32 on 31-01-2024, please ping me.
@Steve-Mcl did you get a chance to revisit this at all?
I did but I simply cannot make it occur. Which makes me think this issue is the result of external happenings or it has been fixed.
If Rob can remember further details or can recreate it, we can jump back on it.
If I see the issue again I'll leave everything in a broken state so you can take a look.
Will assign to Rob and get this off @Steve-Mcl's desk for now.
There were 2 more recent customer support tickets raised that match this issue. As a result, I tracked this down and a PR is incoming.
After littering the code with logging, adding UUID to the connection objects, adding test code & delays, forced network outages, timely MQTT restarts and DOS, I managed to re-create this issue reasonably reliably (~50% time)
Quite simple (and understandable to a degree) in the the end:
An alternative to this was:
Due to the many moving parts (a deploy, a network outage/delay, an MQTT restart, an MQTT auth delay, the MQTT client reconnect timing etc etc) I added 2 means of containment:
connect
/disconnect
/reconnect
event occurs, scan the array and force-end any that are NOT the current connection (i.e. !== to the client
stored in the internal mqtt
singleton client member propertyend()
has been called (successfully or otherwise) there now also is a final call to client.end(force)
- it makes no difference and will 99.9% of the time have ended cleanly but in the event of internal state issues with the client connection this final call absolutely ensures the client never tries to reconnect.With logging (and a fair bit of lucky timing), I can catch both 1 & 2 and (so far) the containment they provide rectify the condition to a single connection.
Verification
@hardillb, Any idea how to test this on staging? As it is, staging loads (for example) "@flowfuse/nr-project-nodes": "0.6.3-dacd505-202403201005.0"
Locally, these have been hammered to death as per above info
The version with the fix is 0.6.2, staging increments the patch level by one when it does a build, so that should have the fix.
Can't really say how to get it to fail specifically on staging
Not sure how to verify this.
External customer with similar problem has reported they are no longer seeing the problem.
@robmarcer have you tested your environment with this recently?
0.6.2 has been in the wild for around 3 weeks and has been downloaded > 1500 times.
coupled with the positive feedback and extremely tricky timing to catch / recreate the original issue, I think we can safely move this into the "done" column @hardillb?
@Steve-Mcl yes
I've not observed this problem for weeks even though I'm still using the same demo which first triggered it.
Current Behavior
Reported by @robmarcer, whilst developing flows and deploying frequently, the Project nodes started getting into a connect/disconnect cycle.
Checking the broker logs shows the client is attempting to make multiple connections to the broker - causing the older connection to be kicked off, which then reconnects and kicks the other connection off and so the cycle continues.
Restarting the instance 'fixes' it.
There must be a timing window around the connection handling where we aren't closing a connection properly.
Expected Behavior
No response
Steps To Reproduce
No response
Environment