andrei-tatar / node-red-contrib-smartnora

Node Red Google Home integration
https://smart-nora.eu/
107 stars 23 forks source link

Devices not syncing properly #198

Closed normanr closed 1 year ago

normanr commented 1 year ago

I'm using Deploy "Modified Flows", and getting a weird behavior where ~most devices are removed and only the changed device remains:

12 Jul 03:27:13 - [info] nora: <default> - disconnected
12 Jul 03:27:14 - [info] nora: <default> - connected
12 Jul 03:27:19 - [info] nora: <default> - synced 16 device(s)
12 Jul 03:40:26 - [info] nora: <default> - disconnected
12 Jul 03:40:27 - [info] nora: <default> - connected
12 Jul 03:40:31 - [info] nora: <default> - no device changes
12 Jul 04:42:27 - [info] nora: <default> - disconnected
12 Jul 04:42:27 - [info] nora: <default> - connected
12 Jul 04:42:32 - [info] nora: <default> - no device changes
[...]
12 Jul 08:14:15 - [info] Stopping modified flows
12 Jul 08:14:15 - [info] Stopped modified flows
12 Jul 08:14:15 - [info] Starting modified flows
12 Jul 08:14:15 - [info] Started modified flows
12 Jul 08:14:16 - [info] nora: authenticated, uid: UVOhdwwyJDf5YXcMecDNesvd4NA2
12 Jul 08:14:16 - [info] nora: <default> - disconnected
12 Jul 08:14:16 - [info] nora: <default> - connected
12 Jul 08:14:22 - [info] nora: <default> - synced 13 device(s)            
12 Jul 08:14:26 - [info] nora: <default> - synced 1 device(s)

This didn't happen ~months ago (when I was doing my initial setup), and is something new. I only have a single noraf-config node.

normanr commented 1 year ago

This was running version 1.13.0. I see 1.13.8 is available, but I don't see anything in the release notes (for 1.13.0 --> 1.13.8) that would address this.

I think this may have started when I imported the async example flow (and then deleted the example config node, and changed the example node to use my config), not sure if that could cause it or not.

normanr commented 1 year ago

I had to restart the Node-RED service to upgrade to 1.13.8, and I can't reproduce the issue. It's also the first time the service was restarted since I imported the async example flow, so I suspect that there's an edge case with importing a different config that might trigger the issue (and that restarting Node-RED cleans out the confusion).

andrei-tatar commented 1 year ago

So a single noraf-config node resulting in 2 separate sync calls for the same group name. Interesting. I'll check if there might've been some bug. I don't I fixed something related to this since 1.13.0

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 14 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

normanr commented 1 year ago

This happened again. I wonder if something in the disconnect/connect triggers the issue, because I can't reproduce it after restarting Node-Red manually. Does the FirebaseSync class instance get re-initialized, and then only a subset of devices re-register? (and then gets re-initialized again and the other devices register?)

# journalctl -b -u nodered.service -g 'graphical|nora' | cut -d' ' -f 6-
2021-10-30 05:43:13 PDT, ends at Wed 2023-03-01 18:32:37 PST. --
Started Node-RED graphical event wiring tool.
21 Feb 15:18:19 - [error] nora: Error: getaddrinfo EAI_AGAIN api.smart-nora.eu
21 Feb 15:18:19 - [warn] nora: trying again in 86.5 sec
21 Feb 15:20:06 - [error] nora: Error: getaddrinfo EAI_AGAIN api.smart-nora.eu
21 Feb 15:20:06 - [warn] nora: trying again in 84.5 sec
21 Feb 22:16:07 - [info] nora: authenticated, uid: UVOhdwwyJDf5YXcMecDNesvd4NA2
21 Feb 22:16:07 - [info] nora: <default> - disconnected
21 Feb 22:16:08 - [info] nora: <default> - connected
21 Feb 22:16:13 - [info] nora: <default> - synced 18 device(s)
23 Feb 13:13:09 - [info] nora: <default> - disconnected
23 Feb 13:13:09 - [info] nora: <default> - connected
23 Feb 13:13:14 - [info] nora: <default> - no device changes
25 Feb 14:17:27 - [info] nora: authenticated, uid: UVOhdwwyJDf5YXcMecDNesvd4NA2
25 Feb 14:17:27 - [info] nora: <default> - disconnected
25 Feb 14:17:27 - [info] nora: <default> - connected
25 Feb 14:17:37 - [info] nora: <default> - synced 19 device(s)
25 Feb 15:21:47 - [info] nora: authenticated, uid: UVOhdwwyJDf5YXcMecDNesvd4NA2
25 Feb 15:21:47 - [info] nora: <default> - disconnected
25 Feb 15:21:47 - [info] nora: <default> - connected
25 Feb 15:21:51 - [info] nora: <default> - synced 18 device(s)
25 Feb 15:21:57 - [info] nora: <default> - synced 1 device(s)
25 Feb 15:22:07 - [info] nora: <default> - connected
25 Feb 15:22:18 - [info] nora: <default> - synced 19 device(s)
27 Feb 07:04:35 - [info] nora: <default> - disconnected
27 Feb 07:04:36 - [info] nora: <default> - connected
27 Feb 07:04:40 - [info] nora: <default> - no device changes
1 Mar 01:24:42 - [info] nora: <default> - disconnected
1 Mar 01:24:43 - [info] nora: <default> - connected
1 Mar 01:24:46 - [info] nora: <default> - no device changes
1 Mar 11:57:30 - [info] nora: <default> - disconnected
1 Mar 11:57:30 - [info] nora: <default> - connected
1 Mar 11:57:37 - [info] nora: <default> - no device changes
1 Mar 17:59:11 - [info] nora: authenticated, uid: UVOhdwwyJDf5YXcMecDNesvd4NA2
1 Mar 17:59:11 - [info] nora: <default> - disconnected
1 Mar 17:59:12 - [info] nora: <default> - connected
1 Mar 17:59:14 - [info] nora: <default> - synced 16 device(s)
1 Mar 17:59:17 - [info] nora: <default> - synced 3 device(s)
1 Mar 18:00:57 - [info] nora: <default> - connected
1 Mar 18:01:03 - [info] nora: <default> - synced 19 device(s)

I wonder if it's caused by the reconnect, and then the partial deploy?

andrei-tatar commented 1 year ago

@normanr I don't think it is. Was there any redeploy of the flows in between? Do you have multiple NORA config nodes that use the same group name? Do you use subflows?

andrei-tatar commented 1 year ago

A bit more detail:

normanr commented 1 year ago

Was there any redeploy of the flows in between?

Do you mean redeploy of non-nora nodes? Possibly. Looking at the logs deploying nora nodes causes a sync, but I don't see log entries for deploys.

One think I noticed is that after the weird behavior, there's no disconnect event before the next connect event. There's always an authenticated event just before (but that's the same for "good" partial deployments too).

Do you have multiple NORA config nodes that use the same group name?

no. I have a single config node with no group name, and a single Node-Red instance.

Do you use subflows?

yes, but I keep all nora nodes outside of subflows, because they didn't seem to like being in a subflow.

after the registry stabilizes (some few hundreds msec without changes) the http call is done to actually sync all the devices

The logs indicate several seconds between the two bad sync events. If things were slow I'd expect the 2nd sync to have all the devices, not the devices that were omitted from the first sync.

fwiw, I'm running Node-Red on a Pi 4 with 8G ram. It's not heavily used (only used for Node-RED, zigbee2mqtt, plenty of free ram, etc), so I don't think this is caused by the hardware being slow/overloaded.

Is there a way to get additional debugging? Something like per-device registration/deregistration with the FirebaseSync instance?

andrei-tatar commented 1 year ago

The logs indicate several seconds between the two bad sync events. If things were slow I'd expect the 2nd sync to have all the devices, not the devices that were omitted from the first sync.

Yeah, that's the really weird part.

fwiw, I'm running Node-Red on a Pi 4 with 8G ram. It's not heavily used (only used for Node-RED, zigbee2mqtt, plenty of free ram, etc), so I don't think this is caused by the hardware being slow/overloaded.

I'm also running on Pi4, pretty much same stuff as you (node-red, nora, zigbee2mqtt, etc.) and never saw this. I do not have any subflows however.

Is there a way to get additional debugging? Something like per-device registration/deregistration with the FirebaseSync instance?

I guess you can add some extra log calls. If you clone nora on your rpi, you could npm link it to the node-red instance so it will run with the changes you make. Let me know if you need any guidance.

The main places are in sync.ts, lines 136, 137: image

diff --git a/src/nora/sync.ts b/src/nora/sync.ts
index 2b64999..2ed7391 100644
--- a/src/nora/sync.ts
+++ b/src/nora/sync.ts
@@ -134,7 +134,11 @@ export class FirebaseSync {
                     : new FirebaseDevice<T>(cloudId, this, device, this.logger, disableValidationErrors);
             observer.next(firebaseDevice);
             this.devices$.next(this.devices$.value.concat(firebaseDevice as any));
-            return () => this.devices$.next(this.devices$.value.filter(d => d !== firebaseDevice));
+            this.logger?.info(`device ${device.id} registered`);
+            return () => {
+                this.devices$.next(this.devices$.value.filter(d => d !== firebaseDevice));
+                this.logger?.info(`device ${device.id} unregistered`);
+            };
         }).pipe(
             switchMap(d => ctx
                 ? merge(
normanr commented 1 year ago

I managed to patch the logging change (I cloned and built the project locally, and diffed the built files and applied the patch on my Pi - still running NR 3.0.2 and SN 1.17.2 because I didn't want to pick up any recent changes that might make it impossible to reproduce). I couldn't reproduce the issue anyway (before or after the patch), so I guess now it's a waiting game to see what happens.

Other observation about the logs:

I wonder if a delay on the Node-Red side might be causing this, something like:

I've also added app.name to the logging to see if this is the case.

normanr commented 1 year ago

Here's the log with precise times from the 1 Mar event:

17:59:10.708772 - [info] Stopping modified nodes
17:59:10.715247 - [info] Stopped modified nodes
17:59:10.715994 - [info] Updated flows
17:59:10.733892 - [info] Starting modified nodes
17:59:10.764673 - [info] Started modified nodes
17:59:11.774473 - [info] nora: authenticated, uid: UVOhdwwyJDf5YXcMecDNesvd4NA2
17:59:11.786257 - [info] nora: <default> - disconnected
17:59:12.194697 - [info] nora: <default> - connected
17:59:14.955812 - [info] nora: <default> - synced 16 device(s)
17:59:17.103968 - [info] nora: <default> - synced 3 device(s)

I don't see any delay on the Node-RED side :-(

andrei-tatar commented 1 year ago

I think if it were a delay problem, it would say

synced 16 devices
synced 19 devices

But for you, it seems like 16 devices get removed until the 2nd sync.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 14 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

normanr commented 1 year ago

aaaand an update. After ~most of a day offline due to ISP maintenance (back online around 19:33), I triggered a partial deploy of 3 nora nodes at 20:11:

-- Boot bd3c67d4ec12432f97ec104acab066d8 --
16 Mar 18:25:59 - [info] nora: cache miss in fromConfig
16 Mar 18:25:59 - [info] nora: cache miss in getAppFromConfig
16 Mar 18:25:59 - [info] nora: app-1679016359031: authenticating...
16 Mar 18:26:19 - [error] nora: Error: getaddrinfo EAI_AGAIN api.smart-nora.eu
16 Mar 18:26:19 - [warn] nora: trying again in 40.5 sec
16 Mar 18:26:19 - [info] nora: undefined: delete from configs
16 Mar 18:26:19 - [info] nora: undefined: delete from apps
16 Mar 18:27:00 - [info] nora: app-1679016420188: authenticating...
16 Mar 18:49:11 - [info] nora: app-1679016420188: authenticated, uid: UVOhdwwyJDf5YXcMecDNesvd4NA2
16 Mar 18:49:12 - [info] nora: app-1679016420188: <default> - disconnected
[... devices register, connects, syncs devices reconnects, no changes, multiple times...]
24 Mar 09:13:09 - [info] nora: app-1679016420188: <default> - connected
24 Mar 09:13:14 - [info] nora: app-1679016420188: <default> - no device changes
29 Mar 07:30:21 - [error] [noraf-light:931a8a4ba2a27285] Error: getaddrinfo EAI_AGAIN api.smart-nora.eu
29 Mar 07:36:34 - [info] nora: app-1679016420188: <default> - disconnected
29 Mar 19:32:59 - [info] nora: app-1679016420188: <default> - connected
29 Mar 19:33:03 - [info] nora: app-1679016420188: <default> - no device changes
29 Mar 20:11:22 - [info] Stopping modified nodes
29 Mar 20:11:22 - [info] nora: app-1679016420188: device 7fbcbb9caa6ce10e unregistered
29 Mar 20:11:22 - [info] nora: app-1679016420188: device 952b526abea35074 unregistered
29 Mar 20:11:22 - [info] nora: app-1679016420188: device ceb4eff01e3469cd unregistered
29 Mar 20:11:22 - [info] Stopped modified nodes
29 Mar 20:11:22 - [info] Updated flows
29 Mar 20:11:22 - [info] Starting modified nodes
29 Mar 20:11:22 - [info] nora: cache miss in fromConfig
29 Mar 20:11:22 - [info] nora: cache miss in getAppFromConfig
29 Mar 20:11:22 - [info] nora: app-1680145882397: authenticating...
29 Mar 20:11:22 - [info] Started modified nodes
29 Mar 20:11:23 - [info] nora: app-1680145882397: authenticated, uid: UVOhdwwyJDf5YXcMecDNesvd4NA2
29 Mar 20:11:23 - [info] nora: app-1680145882397: <default> - disconnected
29 Mar 20:11:23 - [info] nora: app-1680145882397: device 7fbcbb9caa6ce10e registered
29 Mar 20:11:23 - [info] nora: app-1680145882397: device 952b526abea35074 registered
29 Mar 20:11:23 - [info] nora: app-1680145882397: device ceb4eff01e3469cd registered
29 Mar 20:11:24 - [info] nora: app-1680145882397: <default> - connected
29 Mar 20:11:29 - [info] nora: app-1680145882397: <default> - synced 3 device(s)
29 Mar 20:11:31 - [info] nora: app-1679016420188: <default> - synced 16 device(s)
andrei-tatar commented 1 year ago

@normanr hmm, interesting. Is there a way you could export your entire flow?

normanr commented 1 year ago

Sure is there a way I can send it to you privately? I guess there's no real sensitive data, but I'd prefer to not make it public.

I went back and updated the above log to include some earlier events (that might be relevant). As a comparison here's the exact same deployment today:

31 Mar 13:28:43 - [info] nora: app-1680225730500: <default> - connected
31 Mar 13:28:53 - [info] nora: app-1680225730500: <default> - no device changes
1 Apr 15:47:44 - [info] Stopping modified nodes
1 Apr 15:47:44 - [info] nora: app-1680225730500: device 7fbcbb9caa6ce10e unregistered
1 Apr 15:47:44 - [info] nora: app-1680225730500: device 952b526abea35074 unregistered
1 Apr 15:47:44 - [info] nora: app-1680225730500: device ceb4eff01e3469cd unregistered
1 Apr 15:47:44 - [info] Stopped modified nodes
1 Apr 15:47:44 - [info] Updated flows
1 Apr 15:47:44 - [info] Starting modified nodes
1 Apr 15:47:44 - [info] nora: app-1680225730500: device 7fbcbb9caa6ce10e registered
1 Apr 15:47:44 - [info] nora: app-1680225730500: device 952b526abea35074 registered
1 Apr 15:47:44 - [info] nora: app-1680225730500: device ceb4eff01e3469cd registered
1 Apr 15:47:44 - [info] Started modified nodes
1 Apr 15:47:47 - [info] nora: app-1680225730500: <default> - synced 19 device(s)
andrei-tatar commented 1 year ago

@normanr you can send it over on my email address: tatar andrei (in one word) at gmail dot com

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 14 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

andrei-tatar commented 1 year ago

@normanr Did this happen lately? Did you manage to find anything relevant? I didn't receive the flow yet.

I actually have an idea of what the problem might be. I'll try to reproduce it.

13 May 10:59:03 - [info] Starting flows
13 May 10:59:03 - [info] Started flows
13 May 10:59:03 - [error] nora: Error: getaddrinfo ENOTFOUND api.smart-nora.eu
13 May 10:59:03 - [warn] nora: trying again in 65 sec
13 May 11:00:09 - [info] nora: authenticated, uid: bIWnK58ZkwWj8oaHA6jCIXeaPJJ3
13 May 11:00:09 - [info] nora: f1e538c1:0315f8 - disconnected
13 May 11:00:09 - [info] nora: f1e538c1:0315f8 - connected
13 May 11:00:17 - [info] nora: f1e538c1:0315f8 - synced 3 device(s)
13 May 11:00:24 - [info] Stopping modified nodes
13 May 11:00:24 - [info] Stopped modified nodes
13 May 11:00:24 - [info] Updated flows
13 May 11:00:24 - [info] Starting modified nodes
13 May 11:00:24 - [info] Started modified nodes
13 May 11:00:24 - [info] nora: authenticated, uid: bIWnK58ZkwWj8oaHA6jCIXeaPJJ3
13 May 11:00:24 - [info] nora: f1e538c1:0315f8 - disconnected
13 May 11:00:24 - [info] nora: f1e538c1:0315f8 - connected
13 May 11:00:33 - [info] nora: f1e538c1:0315f8 - synced 2 device(s)
13 May 11:00:35 - [info] nora: f1e538c1:0315f8 - synced 1 device(s)

Success! Will push a fix shortly.

andrei-tatar commented 1 year ago

@normanr published version 1.19.0 which fixes the issue.

normanr commented 1 year ago

I assume it was moving finalize after retry that is the core fix. I'll try and find some time in the next week or so to upgrade and test the fix.

andrei-tatar commented 1 year ago

Yes, that was it!

Repro steps are:

  1. Disable network
  2. Start node-red and wait until you see the nora, connection failed retry in x sec
  3. Enable network
  4. Change 1 node and deploy only the modified nodes
  5. Observe

Because finalize was before retry, on the 1st failure the connection was removed from the cache but it still remained active and retrying (because of the retry), while being used by the nodes. When at least one node would change it would result in that node being removed from that orphaned connection and assigned to a new one. Now both connections would sync their devices (1st connection with x-1 devices and 2nd connection with 1 device) overriding each other.