balena-os / balena-supervisor

Balena Supervisor: balena's agent on devices.
https://balena.io
Other
148 stars 63 forks source link

Resource starvation may corrupt database file without a way to recover #2196

Open pipex opened 1 year ago

pipex commented 1 year ago

We have seen multiple instances of resource starved devices, due to limited resources like the PI Zero, or due to resource demanding applications, where the supervisor may experience write errors and broken pipes. One of the issues that happens on these instances, is the supervisor failing to access the database with KnexTimeoutErrors as the one below

Sep 04 10:51:09 3ecd818 balena-supervisor[5381]: [error]   Scheduling another update attempt in 64000ms due to failure:  KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
Sep 04 10:51:09 3ecd818 balena-supervisor[5381]: [error]         at Client_SQLite3.acquireConnection (/usr/src/app/dist/app.js:2:1646253)
Sep 04 10:51:09 3ecd818 balena-supervisor[5381]: [error]       at async Runner.ensureConnection (/usr/src/app/dist/app.js:2:1799705)
Sep 04 10:51:09 3ecd818 balena-supervisor[5381]: [error]       at async Runner.run (/usr/src/app/dist/app.js:2:1796342)
Sep 04 10:51:09 3ecd818 balena-supervisor[5381]: [error]       at async Promise.all (index 1)
Sep 04 10:51:09 3ecd818 balena-supervisor[5381]: [error]       at async Object.exports.getTarget (/usr/src/app/dist/app.js:6:213402)
Sep 04 10:51:09 3ecd818 balena-supervisor[5381]: [error]       at async fn (/usr/src/app/dist/app.js:10:1793)

In some instances this will leave the database in some state of corruption, that leads to a supervisor dead loop with KnexTimeoutErrors. The database is readable with the sqlite3 tool but, knex doesn't seem to be able to open it. The only way to recover is to delete the database (after stopping the supervisor) and let it re-create it.

While there is nothing we can do about the error itself, we are trying to figure out if there is a way to recover automatically from the corrupted state.

One easy way to replicate this is to set a process that runs the bash script

(while true;do date;done)
jellyfish-bot commented 1 year ago

[pipex] This has attached https://jel.ly.fish/7f2ba6be-1b8e-417c-99ae-50771b23a547