OpenNebula / one

The open source Cloud & Edge Computing Platform bringing real freedom to your Enterprise Cloud 🚀
http://opennebula.io
Apache License 2.0
1.19k stars 473 forks source link

Catastrophic data loss because `VirtualMachinePool::calculate_showback()` can leave the SQLite database connection in a transaction forever #6592

Open nomis opened 1 month ago

nomis commented 1 month ago

Description The daily run of VirtualMachinePool::calculate_showback() executes BEGIN TRANSACTION; REPLACE INTO ... (repeated); COMMIT but if one of the REPLACE INTO fails because the database is locked, the COMMIT will not be executed and the single global database connection will remain in the transaction forever.

This went unnoticed for 15 weeks until an unrelated out of memory issue meant that oned was restarted - immediately losing all database changes for the previous 15 weeks. Another 15 weeks later (the timing is a coincidence) the overwriting of an active VM was noticed because they share the same incrementing ID and the VMs that were created in the earlier period where database writes were lost were still running in libvirt.

The initial log messages when this happens are: Fri Oct 20 00:00:21 2023 [Z0][ONE][E]: SQL command was: BEGIN TRANSACTION; REPLACE INTO vm_showback ...; COMMIT, error: database is locked Fri Oct 20 00:00:21 2023 [Z0][ReM][E]: Req:6896 UID:0 one.vmpool.calculateshowback result FAILURE [one.vmpool.calculateshowback] User [0] : Error writing to DB..

The regular daily log messages indicating that this has happened are: Fri Feb 9 00:00:19 2024 [Z0][ONE][E]: SQL command was: BEGIN TRANSACTION; REPLACE INTO vm_showback ...; COMMIT, error: cannot start a transaction within a transaction Fri Feb 9 00:00:19 2024 [Z0][ReM][E]: Req:528 UID:0 one.vmpool.calculateshowback result FAILURE [one.vmpool.calculateshowback] User [0] : Error writing to DB..

To Reproduce I don't know why the database was locked at the time this tried to run so I don't know how to reproduce it other than obtaining an exclusive transaction with another connection.

This issue won't occur on other database types because a transaction isn't used (SqlFeature::MULTIPLE_VALUE is true).

Expected behavior The SQLite database transaction must be rolled back if there's an error executing SQL commands.

Details

Additional context If a sequence of SQL commands is executed and one of them fails, the rest of them will not be executed. For example, the INSERT below fails so the COMMIT is never executed. The connection remains in the transaction. Because this is the only OpenNebula component that uses transactions, COMMIT or ROLLBACK are never executed on their own to fix this and none of the other SQL commands realise that they're executing inside a long transaction.

Process 1 (keep running):

sqlite> .open test.sqlite
sqlite> create table test(a int);
sqlite> begin exclusive;

Process 2:

sqlite> .open test.sqlite
sqlite> begin transaction; insert into a values(42); commit;
Error: database is locked
sqlite> begin transaction;
Error: cannot start a transaction within a transaction

If this has happened in a running instance, the only recovery process I can think of would be to attach to oned with gdb and execute a COMMIT on the database connection (assuming SQLite still has the writes but hasn't committed them and the database isn't still locked). Do not shutdown or restart the service because all the uncommitted writes to be lost.

Progress Status