balena-io / open-balena-vpn

openBalena VPN
GNU Affero General Public License v3.0
58 stars 20 forks source link

VPN link to device down for hours or days following "dropping oos disconnect event" warning in backend logs #221

Closed pdcastro closed 2 years ago

pdcastro commented 2 years ago

Update 2022-08-16 This issue likely refers to the same underlying cause as #287.

Device logs shared by a user (see linked thread):

 2022-03-18 21:46:05.399237+00 | openvpn                                    | Fri Mar 18 21:46:05 2022 OpenVPN 2.4.7 x86_64-poky-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD] built on Mar 15 2021
...
 2022-03-18 21:46:12.228184+00 | openvpn                                    | Fri Mar 18 21:46:12 2022 Initialization Sequence Completed

 2022-03-19 10:05:07.108315+00 | systemd-tmpfiles                           | [/etc/tmpfiles.d/openvpn.conf:1] Line references path below legacy directory /var/run/, updating /var/run/openvpn → /run/openvpn; please update the tmpfiles.d/ drop-in file accordingly.
 2022-03-20 10:05:07.11587+00  | systemd-tmpfiles                           | [/etc/tmpfiles.d/openvpn.conf:1] Line references path below legacy directory /var/run/, updating /var/run/openvpn → /run/openvpn; please update the tmpfiles.d/ drop-in file accordingly.

 2022-03-20 13:50:36.516331+00 | openvpn                                    | Sun Mar 20 13:50:36 2022 [vpn.balena-cloud.com] Inactivity timeout (--ping-restart), restarting

In the logs above, the VPN link was down for around 40 hours (as reported by the user in the linked support thread) between timestamps 2022-03-18 21:46 (Initialization Sequence Completed) and 2022-03-20 13:50 (Inactivity timeout (--ping-restart), restarting).

Correlating those device logs with the 'resin-vpn' backend logs from logentries.com, I see that the outage begins with the following logged warning message (in backend logs):

18 Mar 2022 21:46:56.625open-balena-vpn-5f67b5dd87-pns9p vpn[517]: warning: [vpn-125520] dropping oos disconnect event for uuid=dadc266427f2de52562a5fde78ee0359 worker=7 (expected=3)

Logentries link: https://logentries.com/app/5915e005#/search/log/5b01a592?log_q=where(dadc266427f2de52562a5fde78ee0359)&f=1647637200000&t=1647810000000

The 'dropping oos disconnect event' warning message comes from the following lines of code (VPN backend service): https://github.com/balena-io/open-balena-vpn/blob/v11.2.4/src/api.ts#L121-L132

        if (workerMap[uuid] !== workerId) {
            logger.warning(
                `dropping oos disconnect event for uuid=${uuid} worker=${workerId} (expected=${workerMap[uuid]})`,
            );
            captureException(
                new Error('Out of Sync OpenVPN Client Event Received'),
                'openvpn-oos-event',
                { tags: { uuid }, req },
            );
            return res.sendStatus(400);
        }

So 'oos' stands for 'out of sync'. Note also "worker=7 (expected=3)" in the warning message. Could it be that, for some reason, the backend worker assigned to a device (if there is such as thing, not sure that workers are assigned to devices) changed and this somehow caused the the openvpn service on the device to become idle? The backend sends code 400 (Bad Request) to the device. Long-shot theory, and it is particularly puzzling that openvpn would tolerate 40 hours of radio silence following error code 400. What's going on?

Device running balenaOS v2.95.1, supervisor v12.11.36.

jellyfish-bot commented 2 years ago

[pdcastro] This issue has attached support thread https://jel.ly.fish/0fe20dc6-7fdb-48b7-b39e-75c2a1a569f6

jellyfish-bot commented 2 years ago

[jaomaloy] This has attached https://jel.ly.fish/57808faf-3017-4de8-98de-9a2a26642547

jellyfish-bot commented 2 years ago

[kb2ma] Should be fixed by #285.