Closed pdcastro closed 2 years ago
[pdcastro] This issue has attached support thread https://jel.ly.fish/0fe20dc6-7fdb-48b7-b39e-75c2a1a569f6
[jaomaloy] This has attached https://jel.ly.fish/57808faf-3017-4de8-98de-9a2a26642547
[kb2ma] Should be fixed by #285.
Update 2022-08-16 This issue likely refers to the same underlying cause as #287.
Device logs shared by a user (see linked thread):
In the logs above, the VPN link was down for around 40 hours (as reported by the user in the linked support thread) between timestamps 2022-03-18 21:46 (Initialization Sequence Completed) and 2022-03-20 13:50 (Inactivity timeout (--ping-restart), restarting).
Correlating those device logs with the 'resin-vpn' backend logs from logentries.com, I see that the outage begins with the following logged warning message (in backend logs):
Logentries link: https://logentries.com/app/5915e005#/search/log/5b01a592?log_q=where(dadc266427f2de52562a5fde78ee0359)&f=1647637200000&t=1647810000000
The
'dropping oos disconnect event'
warning message comes from the following lines of code (VPN backend service): https://github.com/balena-io/open-balena-vpn/blob/v11.2.4/src/api.ts#L121-L132So 'oos' stands for 'out of sync'. Note also "worker=7 (expected=3)" in the warning message. Could it be that, for some reason, the backend worker assigned to a device (if there is such as thing, not sure that workers are assigned to devices) changed and this somehow caused the the openvpn service on the device to become idle? The backend sends code 400 (Bad Request) to the device. Long-shot theory, and it is particularly puzzling that openvpn would tolerate 40 hours of radio silence following error code 400. What's going on?
Device running balenaOS v2.95.1, supervisor v12.11.36.