FlowFuse / flowfuse

Connect, collect, transform, visualise, and interact with your Industrial Data in a single platform. Use FlowFuse to manage, scale and secure your Node-RED solutions.
https://flowfuse.com
Other
279 stars 63 forks source link

Device Status update without state #3156

Closed sentry-io[bot] closed 9 months ago

sentry-io[bot] commented 11 months ago

Sentry Issue: FLOWFUSE-BACKEND-1W

TypeError: Cannot read properties of null (reading 'state')
  File "/usr/src/forge/app/node_modules/@flowforge/flowforge/forge/db/controllers/Device.js", line 14, in Object.updateState
    if (state.state) {
  File "/usr/src/forge/app/node_modules/@flowforge/flowforge/forge/comms/devices.js", line 75, in DeviceCommsHandler.handleStatus
    await this.app.db.controllers.Device.updateState(device, payload)
  File "node:internal/process/task_queues", line 95, in process.processTicksAndRejections
Pezmc commented 11 months ago

@Steve-Mcl this might be one you can shed some light on, we occasionally get updates on the devices comms handler where the status message contains a null status.status.

https://github.com/FlowFuse/flowfuse/blob/4b67a6d19501a1bcd28b0ae622f0aedcef076b09/forge/comms/devices.js#L74-L75

We need to track down why this keeps happening. Perhaps a first step is to handle status being null and log extra information to help us debug?

Steve-Mcl commented 11 months ago

I suspect this is a timing thing coupled with the fact we don't sanity check the payload before sending status

  1. upon MQTT connection, client publishes state:
  2. get state is called but it might be busy, so returns null

IMO, the question becomes: In the event of the agent being busy, should we:

a) defer the status publish until the "update" is done and a stable state is available?

b) guard and not transmit null

c) interpret null as "updating" state

d) split out the logic from getState to permit MQTT comms to get actual state (regardless of it being transitional)

(note: the reason getState returns null is due to other internal logic, so we cannot alter that without repercussions)


Additional Note:

There are 2 other places in mqtt.js where publish(statusTopic,...) is called. These ARE guarded / sanity checked.

sentry-io[bot] commented 10 months ago

Sentry issue: FLOWFUSE-BACKEND-2D

Pezmc commented 10 months ago

This issue is now occurring on average 75 times a day and over 5k times a month. Flagging for re-prioritisation, though note it's not causing any problems we can currently see... @knolleary

Pezmc commented 9 months ago

This exception has continued to climb in frequency, now up to 632 a day and 25k per month @MarianRaphael @knolleary

While is it not causing problems, it does feel like a state that should be got to the bottom on and to confirm @Steve-Mcl's proposals above!

knolleary commented 9 months ago

We have a PR from @Steve-Mcl (#3356) that turns this error into a log message. However I believe that is just hiding the issue - we shouldn't be getting to that point of the code for a device in the state we believe it to be. I have asked Steve to look again.

Separately, we have identified a number of orphaned devices that we can delete - however due to the way the broker acl checks work, those devices won't get kicked off, so we still need to ensure the platform properly rejects/ignores such devices.

knolleary commented 9 months ago

Sorry, I got my device/sentry issues mixed up. The PR from Steve was for a different issue - one we haven't seen in production for over a month.

We still need to do something about this one.

knolleary commented 9 months ago

Regardless of what changes might be made to the device agent, that misses the more immediate issue that we're getting 100s of uncaught exceptions every day.

We need to handle a null status message properly in the app. I'll look at that now.