Description
The daily run of VirtualMachinePool::calculate_showback() executes BEGIN TRANSACTION; REPLACE INTO ... (repeated); COMMIT but if one of the REPLACE INTO fails because the database is locked, the COMMIT will not be executed and the single global database connection will remain in the transaction forever.
This went unnoticed for 15 weeks until an unrelated out of memory issue meant that oned was restarted - immediately losing all database changes for the previous 15 weeks. Another 15 weeks later (the timing is a coincidence) the overwriting of an active VM was noticed because they share the same incrementing ID and the VMs that were created in the earlier period where database writes were lost were still running in libvirt.
The initial log messages when this happens are:
Fri Oct 20 00:00:21 2023 [Z0][ONE][E]: SQL command was: BEGIN TRANSACTION; REPLACE INTO vm_showback ...; COMMIT, error: database is lockedFri Oct 20 00:00:21 2023 [Z0][ReM][E]: Req:6896 UID:0 one.vmpool.calculateshowback result FAILURE [one.vmpool.calculateshowback] User [0] : Error writing to DB..
The regular daily log messages indicating that this has happened are:
Fri Feb 9 00:00:19 2024 [Z0][ONE][E]: SQL command was: BEGIN TRANSACTION; REPLACE INTO vm_showback ...; COMMIT, error: cannot start a transaction within a transactionFri Feb 9 00:00:19 2024 [Z0][ReM][E]: Req:528 UID:0 one.vmpool.calculateshowback result FAILURE [one.vmpool.calculateshowback] User [0] : Error writing to DB..
To Reproduce
I don't know why the database was locked at the time this tried to run so I don't know how to reproduce it other than obtaining an exclusive transaction with another connection.
This issue won't occur on other database types because a transaction isn't used (SqlFeature::MULTIPLE_VALUE is true).
Expected behavior
The SQLite database transaction must be rolled back if there's an error executing SQL commands.
Details
Hypervisor: KVM
Version: 6.0.0.3 (yes, this is very old but the implementation of this part of VirtualMachinePool hasn't changed)
Additional context
If a sequence of SQL commands is executed and one of them fails, the rest of them will not be executed. For example, the INSERT below fails so the COMMIT is never executed. The connection remains in the transaction. Because this is the only OpenNebula component that uses transactions, COMMIT or ROLLBACK are never executed on their own to fix this and none of the other SQL commands realise that they're executing inside a long transaction.
sqlite> .open test.sqlite
sqlite> begin transaction; insert into a values(42); commit;
Error: database is locked
sqlite> begin transaction;
Error: cannot start a transaction within a transaction
If this has happened in a running instance, the only recovery process I can think of would be to attach to oned with gdb and execute a COMMIT on the database connection (assuming SQLite still has the writes but hasn't committed them and the database isn't still locked). Do not shutdown or restart the service because all the uncommitted writes to be lost.
Description The daily run of
VirtualMachinePool::calculate_showback()
executesBEGIN TRANSACTION; REPLACE INTO ... (repeated); COMMIT
but if one of theREPLACE INTO
fails because the database is locked, theCOMMIT
will not be executed and the single global database connection will remain in the transaction forever.This went unnoticed for 15 weeks until an unrelated out of memory issue meant that
oned
was restarted - immediately losing all database changes for the previous 15 weeks. Another 15 weeks later (the timing is a coincidence) the overwriting of an active VM was noticed because they share the same incrementing ID and the VMs that were created in the earlier period where database writes were lost were still running in libvirt.The initial log messages when this happens are:
Fri Oct 20 00:00:21 2023 [Z0][ONE][E]: SQL command was: BEGIN TRANSACTION; REPLACE INTO vm_showback ...; COMMIT, error: database is locked
Fri Oct 20 00:00:21 2023 [Z0][ReM][E]: Req:6896 UID:0 one.vmpool.calculateshowback result FAILURE [one.vmpool.calculateshowback] User [0] : Error writing to DB..
The regular daily log messages indicating that this has happened are:
Fri Feb 9 00:00:19 2024 [Z0][ONE][E]: SQL command was: BEGIN TRANSACTION; REPLACE INTO vm_showback ...; COMMIT, error: cannot start a transaction within a transaction
Fri Feb 9 00:00:19 2024 [Z0][ReM][E]: Req:528 UID:0 one.vmpool.calculateshowback result FAILURE [one.vmpool.calculateshowback] User [0] : Error writing to DB..
To Reproduce I don't know why the database was locked at the time this tried to run so I don't know how to reproduce it other than obtaining an exclusive transaction with another connection.
This issue won't occur on other database types because a transaction isn't used (
SqlFeature::MULTIPLE_VALUE
istrue
).Expected behavior The SQLite database transaction must be rolled back if there's an error executing SQL commands.
Details
Additional context If a sequence of SQL commands is executed and one of them fails, the rest of them will not be executed. For example, the
INSERT
below fails so theCOMMIT
is never executed. The connection remains in the transaction. Because this is the only OpenNebula component that uses transactions,COMMIT
orROLLBACK
are never executed on their own to fix this and none of the other SQL commands realise that they're executing inside a long transaction.Process 1 (keep running):
Process 2:
If this has happened in a running instance, the only recovery process I can think of would be to attach to
oned
withgdb
and execute aCOMMIT
on the database connection (assuming SQLite still has the writes but hasn't committed them and the database isn't still locked). Do not shutdown or restart the service because all the uncommitted writes to be lost.Progress Status