Closed sentry-io[bot] closed 9 months ago
@Steve-Mcl this might be one you can shed some light on, we occasionally get updates on the devices comms handler where the status message contains a null status.status
.
We need to track down why this keeps happening. Perhaps a first step is to handle status being null and log extra information to help us debug?
I suspect this is a timing thing coupled with the fact we don't sanity check the payload before sending status
IMO, the question becomes: In the event of the agent being busy, should we:
a) defer the status publish until the "update" is done and a stable state is available?
b) guard and not transmit null
c) interpret null
as "updating" state
d) split out the logic from getState
to permit MQTT comms to get actual state (regardless of it being transitional)
(note: the reason getState
returns null
is due to other internal logic, so we cannot alter that without repercussions)
Additional Note:
There are 2 other places in mqtt.js where publish(statusTopic,...)
is called. These ARE guarded / sanity checked.
Sentry issue: FLOWFUSE-BACKEND-2D
This issue is now occurring on average 75 times a day and over 5k times a month. Flagging for re-prioritisation, though note it's not causing any problems we can currently see... @knolleary
This exception has continued to climb in frequency, now up to 632 a day and 25k per month @MarianRaphael @knolleary
While is it not causing problems, it does feel like a state that should be got to the bottom on and to confirm @Steve-Mcl's proposals above!
We have a PR from @Steve-Mcl (#3356) that turns this error into a log message. However I believe that is just hiding the issue - we shouldn't be getting to that point of the code for a device in the state we believe it to be. I have asked Steve to look again.
Separately, we have identified a number of orphaned devices that we can delete - however due to the way the broker acl checks work, those devices won't get kicked off, so we still need to ensure the platform properly rejects/ignores such devices.
Sorry, I got my device/sentry issues mixed up. The PR from Steve was for a different issue - one we haven't seen in production for over a month.
We still need to do something about this one.
Regardless of what changes might be made to the device agent, that misses the more immediate issue that we're getting 100s of uncaught exceptions every day.
We need to handle a null
status message properly in the app. I'll look at that now.
Sentry Issue: FLOWFUSE-BACKEND-1W