balena-io / open-balena-api

The core API of openBalena
https://balena.io/open
GNU Affero General Public License v3.0
59 stars 28 forks source link

API heartbeat error #1349

Open dcaputo-harmoni opened 1 year ago

dcaputo-harmoni commented 1 year ago

I'm receiving the message below for a group of devices exactly every 30 seconds when running open-balena-api 11.8.3. Upgraded from an older version (0.194.0) so am wondering if could be caused by items that existed in the device-online-state redis queue at the time of cutover that were entered using the older api. I believe previously this queue used uuid to identify devices, and that was changed over to id at some point between my last version and the current one. I'm assuming the 30 second frequency is driven by RSMQ_READ_TIMEOUT being set to 30 seconds.

And separately, does anyone know if I can address this by just flushing the device state queue in redis? If so, how would I go about doing this? Presumably I could just run FLUSHALL on my redis instance but not sure if that would have any negative effects elsewhere.

May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]: DeviceStateManager: Error updating the API state of device undefined to timeout. If the id property is set it must be non-null Error: If the id property is set it must be non-null
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]:     at PinejsClient.compile (/usr/src/app/node_modules/pinejs-client-core/src/index.ts:1421:12)
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]:     at PinejsClient.request (/usr/src/app/node_modules/pinejs-client-core/src/index.ts:1512:32)
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]:     at PinejsClient.patch (/usr/src/app/node_modules/pinejs-client-core/src/index.ts:1194:15)
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]:     at DeviceOnlineStateManager.updateDeviceModel (/usr/src/app/src/features/device-heartbeat/index.ts:251:20)
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]:     at /usr/src/app/src/features/device-heartbeat/index.ts:303:19
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]:     at processTicksAndRejections (node:internal/process/task_queues:95:5)
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]: An error occurred trying to process an API heartbeat event. If the id property is set it must be non-null Error: If the id property is set it must be non-null
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]:     at PinejsClient.compile (/usr/src/app/node_modules/pinejs-client-core/src/index.ts:1421:12)
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]:     at PinejsClient.request (/usr/src/app/node_modules/pinejs-client-core/src/index.ts:1512:32)
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]:     at PinejsClient.patch (/usr/src/app/node_modules/pinejs-client-core/src/index.ts:1194:15)
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]:     at DeviceOnlineStateManager.updateDeviceModel (/usr/src/app/src/features/device-heartbeat/index.ts:251:20)
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]:     at /usr/src/app/src/features/device-heartbeat/index.ts:303:19
May 13 20:02:12 openbalena-api-78d9b9466-xqvlz api[971]:     at processTicksAndRejections (node:internal/process/task_queues:95:5)

And one further note - when I view the device-online-state:expired:Q hash in redis I get the following (truncated - but more below with uuid).

127.0.0.1:6379> hgetall device-online-state:expired:Q
  1) "gkxl808qmpd3Q09ZU5HOH7O3ZkpA3deZ"
  2) "{\"deviceId\":66,\"nextState\":\"timeout\"}"
  3) "gkvm3r7i83BPsSYBJHw8Zj8F9fAJ7zCd:fr"
  4) "1683861304212"
  5) "gkvm4ef4vabtiFjXfeIOEjUJQlfuPFMn:fr"
  6) "1683861304248"
  7) "gkvm1yk0yq5VpBcil8I9ASWEInJAevHZ:fr"
  8) "1683861304200"
  9) "gkvm1jg53eTW74Ye79n5Esmoi86Kxjef:fr"
 10) "1683861304195"
 11) "gkvm3yxck2ZMISogzEeFE3QRL3J021xk:fr"
 12) "1683861304220"
 13) "gkvluxxc96XeK4qUYO6mus29Y65zfBUp"
 14) "{\"uuid\":\"9050ca7f113caf47396d4f1493cc945f\",\"nextState\":\"timeout\"}"
 15) "gkvm3yxck2ZMISogzEeFE3QRL3J021xk:rc"
 16) "5026"
 17) "gkvlz9rltlg7SuWT99sueEBbgkf1zXYh"
 18) "{\"uuid\":\"70429306a30c6c747f88ba64739218c0\",\"nextState\":\"timeout\"}"
...
thgreasi commented 1 year ago

Hi, We indeed changed the state manager's object and switched it from uuids to ids as you mentioned to make the implementation slightly more performant and using less space in Redis at the same time. In previous API version the heartbeat mechanism was quite forgiving for such migration errors, but we recently revamped it so that to avoid having the DB & Redis device heartbeat states go out of sync, after which point we changed it fails for such unexpected cases, so that we can notice them if they are occurring.

In general we do suggest upgrading one major at a time and I think in this case it would have helped avoiding this issue. A way that you can use to get your device state in sync between the DB & Redis after v11.7.0 is setting the API_HEARTBEAT_STATE_ONLINE_UPDATE_CACHE_TIMEOUT env var to any non-negative integer value (milliseconds). I do see that you already have a PR for this though, so I guess that you might no longer be in this inconsistent state so the API_HEARTBEAT_STATE_ONLINE_UPDATE_CACHE_TIMEOUT is only a suggestion in case you face a similar issue in the future.