Backup live "write lock" error and workaround for it

Franco-Sparrow commented 1 year ago

Description

Baackups live can't be done because of write lock of the qcow2 disk.

To Reproduce

Backup options for VM 84:

BACKUP_CONFIG=[
  BACKUP_VOLATILE="NO",
  FS_FREEZE="AGENT",
  INCREMENTAL_BACKUP_ID="-1",
  KEEP_LAST="7",
  LAST_INCREMENT_ID="-1",
  MODE="INCREMENT" ]

System datastore:

DATASTORE 0 INFORMATION
ID             : 0
NAME           : system
USER           : oneadmin
GROUP          : oneadmin
CLUSTERS       : 0
TYPE           : SYSTEM
DS_MAD         : -
TM_MAD         : shared
BASE PATH      : /var/lib/one//datastores/0
DISK_TYPE      : FILE
STATE          : READY

DATASTORE CAPACITY
TOTAL:         : 759G
FREE:          : 491.4G
USED:          : 267.6G
LIMIT:         : -

PERMISSIONS
OWNER          : um-
GROUP          : u--
OTHER          : ---

DATASTORE TEMPLATE
ALLOW_ORPHANS="FORMAT"
DISK_TYPE="FILE"
DS_MIGRATE="YES"
RESTRICTED_DIRS="/"
SAFE_DIRS="/var/tmp"
SHARED="YES"
TM_MAD="shared"
TYPE="SYSTEM_DS"

IMAGES

Images datastore:

DATASTORE 1 INFORMATION
ID             : 1
NAME           : images
USER           : oneadmin
GROUP          : oneadmin
CLUSTERS       : 0
TYPE           : IMAGE
DS_MAD         : fs
TM_MAD         : shared
BASE PATH      : /var/lib/one//datastores/1
DISK_TYPE      : FILE
STATE          : READY

DATASTORE CAPACITY
TOTAL:         : 759G
FREE:          : 491.4G
USED:          : 267.6G
LIMIT:         : -

PERMISSIONS
OWNER          : um-
GROUP          : u--
OTHER          : ---

DATASTORE TEMPLATE
ALLOW_ORPHANS="FORMAT"
CLONE_TARGET="SYSTEM"
CLONE_TARGET_SSH="SYSTEM"
DISK_TYPE="FILE"
DISK_TYPE_SSH="FILE"
DS_MAD="fs"
LN_TARGET="NONE"
LN_TARGET_SSH="SYSTEM"
RESTRICTED_DIRS="/"
SAFE_DIRS="/var/tmp"
TM_MAD="shared"
TM_MAD_SYSTEM="ssh"
TYPE="IMAGE_DS"

IMAGES
0
1
2
3
4
5
6
8
20
26
29
31
35
38
41
42
43
52
53
55
56
57
58
59
61
62
63
64

Gluster volume info:

gluster volume info vol1

Volume Name: vol1
Type: Distributed-Disperse
Volume ID: 6a665cb2-82e5-41c7-84e2-c6a7acae8bca
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: storage170:/data/glusterfs/vol1/brick0/data
Brick2: storage170:/data/glusterfs/vol1/brick1/data
Brick3: storage170:/data/glusterfs/vol1/brick2/data
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
storage.owner-uid: 9869
storage.owner-gid: 9869
server.allow-insecure: on
network.remote-dio: enable
cluster.eager-lock: enable
performance.io-cache: off
performance.quick-read: off
performance.read-ahead: off
user.cifs: off
features.shard: enable
features.shard-block-size: 5GB

Trying to make a backup live with --reset:

onevm backup --reset -d 100 84

The following is a snippet from the /var/log/one/84.log after failed the backup action:

# [...]
# -------------------------------------- # Create backup live # -------------------------------------- /var/tmp/one/tm/lib/backup_qcow2.rb -l -d "0:" -x /var/lib/one//datastores/0/84/backup/
vm.xml -p /var/lib/one//datastores/0/84  Error: Error executing 'qemu-img convert -m 4 -O qcow2 /var/lib/datastores/gluster_mount/0/84/disk.0.snap/0 /var/lib/
one//datastores/0/84/backup/disk.0.0':  qemu-img: Could not open '/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0': Failed to get shared "write" lock Is
another process using the image [/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0]? Error preparing disk files: Error executing 'qemu-img convert -m 4 -O
qcow2 /var/lib/datastores/gluster_mount/0/84/disk.0.snap/0 /var/lib/one//datastores/0/84/backup/disk.0.0':  qemu-img: Could not open '/var/lib/datastores/glus
ter_mount/0/84/disk.0.snap/0': Failed to get shared "write" lock Is another process using the image [/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0]?  +
 '[' -d /var/lib/one//datastores/0/84/tmp ']' + '[' -d /var/lib/one//datastores/0/84/backup ']' + rm -rf /var/lib/one//datastores/0/84/backup + mkdir -p /var/
lib/one//datastores/0/84/tmp + mkdir -p /var/lib/one//datastores/0/84/backup + echo # [...]

Trying to make a backup live without --reset:

onevm backup -d 100 84

# [...]
# -------------------------------------- # Create backup live # ------------------------------------
-- /var/tmp/one/tm/lib/backup_qcow2.rb -l -d "0:" -x /var/lib/one//datastores/0/84/backup/vm.xml -p /var/lib/one//datastores/0/84  Error: Error executing 'qem
u-img convert -m 4 -O qcow2 /var/lib/datastores/gluster_mount/0/84/disk.0.snap/0 /var/lib/one//datastores/0/84/backup/disk.0.0':  qemu-img: Could not open '/v
ar/lib/datastores/gluster_mount/0/84/disk.0.snap/0': Failed to get shared "write" lock Is another process using the image [/var/lib/datastores/gluster_mount/0
/84/disk.0.snap/0]? Error preparing disk files: Error executing 'qemu-img convert -m 4 -O qcow2 /var/lib/datastores/gluster_mount/0/84/disk.0.snap/0 /var/lib/
one//datastores/0/84/backup/disk.0.0':  qemu-img: Could not open '/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0': Failed to get shared "write" lock Is
another process using the image [/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0]?  + '[' -d /var/lib/one//datastores/0/84/tmp ']' + '[' -d /var/lib/one/
/datastores/0/84/backup ']' + mkdir -p /var/lib/one//datastores/0/84/tmp + mkdir -p /var/lib/one//datastores/0/84/backup + echo [...]

In both cases the main concern is this one: Failed to get shared "write" lock Is another process using the image [/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0]?

Expected behavior

Backup without the "write lock" error.

Proposed solution

Thanks to nacho for the help here :)

Edit the file /var/lib/one/remotes/tm/lib/backup_qcow2.rb on all the orchestrators:

sudo -u oneadmin nano /var/lib/one/remotes/tm/lib/backup_qcow2.rb

Leave this as follow (line 24 edited):

# [...]
19 require 'json'
20 require 'open3'
21 require 'rexml/document'
22 require 'base64'
23 require 'getoptlong'
24 require 'timeout'
25
26 require_relative './kvm'
# [...]

Add the folling code from line 563 to 573 (definition waitState), just right before the definition backup_full_live:

# [...]
563
564    def waitState(srcDiskFile)
565        out = `qemu-img info #{srcDiskFile}`
566        # log("info","out = #{out}, sucess = #{$?.success?}")
567        unless $?.success?
568           sleep(5)
569           return waitState(srcDiskFile)
570        else
571           return true
572        end
573    end
574
575    def backup_full_live(disks_s)
# [...]

Add the following code, just after the line 628 (new lines added from line 629 to 637):

628        qdisk.each do |did, disk|
629            begin
630                log("[BCK]: Waiting for image release")
631                status = Timeout::timeout(120) {
632                    waitState("#{@vm_dir}/disk.#{did}")
633                }
634            rescue Timeout::Error => e
635                log("[BCK]: Failed to release the image for write, Timout exceeded")
636                log("[BCK]: ERR: #{e}")
637            end
638            disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2')
639        end

NOTE
Remember to copy this edited file to the rest of the orchestrators (frontends).

Update the hosts:

sudo -u oneadmin onehost sync -f

The output is as follow:

* Adding ON-C2-N2-kvm to upgrade
* Adding ON-C2-N3-kvm to upgrade
* Adding ON-C2-N1-kvm to upgrade
[========================================] 3/3 ON-C2-N1-kvm
All hosts updated successfully.

NOTE
Do the follow before try again with the backup.

Shutdown the VM:
onevm poweroff 84
Delete following files:
rm -rf /var/lib/one/datastores/0/84/backup/ && \
rm -rf /var/lib/one/datastores/0/84/tmp/
Get the host where is running the VM:
onevm show 84 | grep HOST | grep kvm | awk '{print $NF}'
In my case, was ON-C2-N2-kvm.

Remove possibles existent checkpoints:
ssh ON-C2-N2-kvm "rm -rf /var/lib/libvirt/qemu/domain-*-one-84/"
Resume the VM:
onevm resume 84

Run the backup again:

onevm backup --reset -d 100 84

The VM logfile /var/log/one/84.log will show the success of the FULL backup:

# [...]
Thu May  4 09:56:43 2023 [Z0][VM][I]: New LCM state is RUNNING
Thu May  4 09:56:43 2023 [Z0][VMM][I]: ExitCode: 0
Thu May  4 09:56:44 2023 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Thu May  4 09:56:44 2023 [Z0][VMM][I]: Successfully execute network driver operation: post.
Thu May  4 09:57:05 2023 [Z0][VM][I]: New LCM state is BACKUP
Thu May  4 09:57:20 2023 [Z0][VMM][I]: Successfully execute transfer manager driver operation: prebackup_live.
Thu May  4 09:57:25 2023 [Z0][VMM][I]: Successfully execute  operation: backup.
Thu May  4 09:57:27 2023 [Z0][VMM][I]: Successfully execute transfer manager driver operation: postbackup_live.
Thu May  4 09:57:27 2023 [Z0][VMM][I]: VM backup successfully created.
Thu May  4 09:57:27 2023 [Z0][VM][I]: New LCM state is RUNNING

Run an incremental backup now:

onevm backup -d 100 84

The VM logfile /var/log/one/84.log will show the success of the INCREMENTAL backup:

# [...]
Thu May  4 09:59:01 2023 [Z0][VM][I]: New LCM state is BACKUP
Thu May  4 09:59:04 2023 [Z0][VMM][I]: Successfully execute transfer manager driver operation: prebackup_live.
Thu May  4 09:59:06 2023 [Z0][VMM][I]: Successfully execute  operation: backup.
Thu May  4 09:59:07 2023 [Z0][VMM][I]: Successfully execute transfer manager driver operation: postbackup_live.
Thu May  4 09:59:07 2023 [Z0][VMM][I]: VM backup successfully created.
Thu May  4 09:59:07 2023 [Z0][VM][I]: New LCM state is RUNNING

The workaround reconfigure OpenNebula to wait for the file (VM qcow2 disk) to be writeable until the count arrives to 120s (2 min). If the file is write lock, it will wait a period of 5s before try it again. If after 24 periods (120s) the file was not released to be writeable, then it will fail the backup, just as it failed before the workaround.

This was done for a cluster with external storage (Gluster Distributed-Disperse), shared over a Gigabitethernet network and SSD SATA drives.

Details

Affected Component: [Storage]
Hypervisor: [KVM]
Version: [6.6.1]
OS: Ubuntu 22.04

Progress Status

[ ] Code committed
[ ] Testing - QA
[ ] Documentation (Release notes - resolved issues, compatibility, known issues)

rsmontero commented 1 year ago

Is this hapening to you only after using the reset flag?

Franco-Sparrow commented 1 year ago

Is this hapening to you only after using the reset flag?

Hi Sir, thank you for the quick reply.

Is happening allways, it doesnt matter if reset flag was specified first or before. We made a workaround for it and fix it. Please, take a look at the code we added and the details.

rsmontero commented 1 year ago

Please include full VM log files to verify configuration issues, I think it could be related to your gluster setup

rsmontero commented 1 year ago

It is already check, as we make an snapshot before converting the disk, no the issue is this one:

Thu May  4 09:49:38 2023 [Z0][VMM][I]: Error: Error executing 'virsh --connect qemu:///system snapshot-create-as --name one-84-backup --disk-only --atomic --d
iskspec vda,file=/var/lib/one//datastores/0/84/tmp/overlay_0.qcow2 2a727440-613a-40c3-9b13-26c57517727f':
Thu May  4 09:49:38 2023 [Z0][VMM][I]:
Thu May  4 09:49:38 2023 [Z0][VMM][I]: error: operation failed: domain moment one-84-backup already exists

You can either delete that manaully or try a reset to clean the backup chain

nachowork90 commented 1 year ago

I got the same issue, but I'm not able to reproduce in differents scenarios, the only common denominator is the storage backend locks release!

Franco-Sparrow commented 1 year ago

It is already check, as we make an snapshot before converting the disk, no the issue is this one:

Thu May  4 09:49:38 2023 [Z0][VMM][I]: Error: Error executing 'virsh --connect qemu:///system snapshot-create-as --name one-84-backup --disk-only --atomic --d
iskspec vda,file=/var/lib/one//datastores/0/84/tmp/overlay_0.qcow2 2a727440-613a-40c3-9b13-26c57517727f':
Thu May  4 09:49:38 2023 [Z0][VMM][I]:
Thu May  4 09:49:38 2023 [Z0][VMM][I]: error: operation failed: domain moment one-84-backup already exists

You can either delete that manaully or try a reset to clean the backup chain

Wrong log. Sorry for that. I had that error after many tries, it created a snapshot during tests. But that error is not the one of the main post here. Here are the lines from the first test:

Thu May  4 08:50:58 2023 [Z0][VM][I]: New LCM state is BACKUP
Thu May  4 08:51:00 2023 [Z0][VMM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/tm/shared/prebackup_live ON-C2-N2-kvm:/var/lib/one//dat
astores/0/84 0: 2a727440-613a-40c3-9b13-26c57517727f 84 0
Thu May  4 08:51:00 2023 [Z0][VMM][E]: prebackup_live: Command failed:
Thu May  4 08:51:00 2023 [Z0][VMM][I]: export LANG=C
Thu May  4 08:51:00 2023 [Z0][VMM][I]: export LC_ALL=C
Thu May  4 08:51:00 2023 [Z0][VMM][I]: set -ex -o pipefail
Thu May  4 08:51:01 2023 [Z0][VMM][I]:
Thu May  4 08:51:01 2023 [Z0][VMM][I]: # ----------------------------------
Thu May  4 08:51:01 2023 [Z0][VMM][I]: # Prepare the tmp and backup folders
Thu May  4 08:51:01 2023 [Z0][VMM][I]: # ----------------------------------
Thu May  4 08:51:01 2023 [Z0][VMM][I]: [ -d /var/lib/one//datastores/0/84/tmp ] && rm -rf /var/lib/one//datastores/0/84/tmp
Thu May  4 08:51:01 2023 [Z0][VMM][I]:
Thu May  4 08:51:01 2023 [Z0][VMM][I]: [ -d /var/lib/one//datastores/0/84/backup ] && rm -rf /var/lib/one//datastores/0/84/backup
Thu May  4 08:51:01 2023 [Z0][VMM][I]:
Thu May  4 08:51:01 2023 [Z0][VMM][I]: mkdir -p /var/lib/one//datastores/0/84/tmp
Thu May  4 08:51:01 2023 [Z0][VMM][I]:
Thu May  4 08:51:01 2023 [Z0][VMM][I]: mkdir -p /var/lib/one//datastores/0/84/backup
Thu May  4 08:51:01 2023 [Z0][VMM][I]:
Thu May  4 08:51:01 2023 [Z0][VMM][I]: echo "PFZNPjxJRD [...]
Thu May  4 08:51:04 2023 [Z0][VMM][I]: Uz48SUQ+NjU8L0lEPjwvQkFDS1VQX0lEUz48L0JBQ0tVUFM+PC9WTT4=
Thu May  4 08:51:04 2023 [Z0][VMM][I]: " > /var/lib/one//datastores/0/84/backup/vm.xml
Thu May  4 08:51:04 2023 [Z0][VMM][I]:
Thu May  4 08:51:04 2023 [Z0][VMM][I]: # --------------------------------------
Thu May  4 08:51:04 2023 [Z0][VMM][I]: # Create backup live
Thu May  4 08:51:04 2023 [Z0][VMM][I]: # --------------------------------------
Thu May  4 08:51:04 2023 [Z0][VMM][I]: /var/tmp/one/tm/lib/backup_qcow2.rb -l -d "0:" -x /var/lib/one//datastores/0/84/backup/vm.xml -p /var/lib/one//datastor
es/0/84
Thu May  4 08:51:04 2023 [Z0][VMM][I]:
Thu May  4 08:51:04 2023 [Z0][VMM][I]: Error: Error executing 'qemu-img convert -m 4 -O qcow2 /var/lib/datastores/gluster_mount/0/84/disk.0.snap/0 /var/lib/on
e//datastores/0/84/backup/disk.0.0':
Thu May  4 08:51:04 2023 [Z0][VMM][I]: qemu-img: Could not open '/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0': Failed to get shared "write" lock
Thu May  4 08:51:04 2023 [Z0][VMM][I]: Is another process using the image [/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0]?
Thu May  4 08:51:04 2023 [Z0][VMM][I]: Error preparing disk files: Error executing 'qemu-img convert -m 4 -O qcow2 /var/lib/datastores/gluster_mount/0/84/disk
.0.snap/0 /var/lib/one//datastores/0/84/backup/disk.0.0':
Thu May  4 08:51:04 2023 [Z0][VMM][I]: qemu-img: Could not open '/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0': Failed to get shared "write" lock
Thu May  4 08:51:04 2023 [Z0][VMM][I]: Is another process using the image [/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0]?
Thu May  4 08:51:04 2023 [Z0][VMM][I]: + '[' -d /var/lib/one//datastores/0/84/tmp ']'
Thu May  4 08:51:04 2023 [Z0][VMM][I]: + '[' -d /var/lib/one//datastores/0/84/backup ']'
Thu May  4 08:51:04 2023 [Z0][VMM][I]: + mkdir -p /var/lib/one//datastores/0/84/tmp
Thu May  4 08:51:04 2023 [Z0][VMM][I]: + mkdir -p /var/lib/one//datastores/0/84/backup
Thu May  4 08:51:04 2023 [Z0][VMM][I]: + echo 'PFZNPjxJRD44NDwvSUQ+PFVJR [...]
Thu May  4 08:51:08 2023 [Z0][VMM][I]: Uz48SUQ+NjU8L0lEPjwvQkFDS1VQX0lEUz48L0JBQ0tVUFM+PC9WTT4=
Thu May  4 08:51:08 2023 [Z0][VMM][I]: '
Thu May  4 08:51:08 2023 [Z0][VMM][I]: + /var/tmp/one/tm/lib/backup_qcow2.rb -l -d 0: -x /var/lib/one//datastores/0/84/backup/vm.xml -p /var/lib/one//datastor
es/0/84
Thu May  4 08:51:08 2023 [Z0][VMM][I]: Failed to execute transfer manager driver operation: prebackup_live.
Thu May  4 08:51:08 2023 [Z0][VMM][E]: BACKUP: ERROR: prebackup_live: Command failed: export LANG=C export LC_ALL=C set -ex -o pipefail  # -------------------
--------------- # Prepare the tmp and backup folders # ---------------------------------- [ -d /var/lib/one//datastores/0/84/tmp ] && rm -rf /var/lib/one//dat
astores/0/84/tmp  [ -d /var/lib/one//datastores/0/84/backup ] && rm -rf /var/lib/one//datastores/0/84/backup  mkdir -p /var/lib/one//datastores/0/84/tmp  mkdi
r -p /var/lib/one//datastores/0/84/backup  echo "PFZNPjxJRD44 [...] " > /var/lib/one//datastores/0/84/backup/vm.xml  # -------------------------------------- # Create backup live # ------------------------------------
-- /var/tmp/one/tm/lib/backup_qcow2.rb -l -d "0:" -x /var/lib/one//datastores/0/84/backup/vm.xml -p /var/lib/one//datastores/0/84  Error: Error executing 'qem
u-img convert -m 4 -O qcow2 /var/lib/datastores/gluster_mount/0/84/disk.0.snap/0 /var/lib/one//datastores/0/84/backup/disk.0.0':  qemu-img: Could not open '/v
ar/lib/datastores/gluster_mount/0/84/disk.0.snap/0': Failed to get shared "write" lock Is another process using the image [/var/lib/datastores/gluster_mount/0
/84/disk.0.snap/0]? Error preparing disk files: Error executing 'qemu-img convert -m 4 -O qcow2 /var/lib/datastores/gluster_mount/0/84/disk.0.snap/0 /var/lib/
one//datastores/0/84/backup/disk.0.0':  qemu-img: Could not open '/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0': Failed to get shared "write" lock Is
another process using the image [/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0]?  + '[' -d /var/lib/one//datastores/0/84/tmp ']' + '[' -d /var/lib/one/
/datastores/0/84/backup ']' + mkdir -p /var/lib/one//datastores/0/84/tmp + mkdir -p /var/lib/one//datastores/0/84/backup + echo 'PFZNPjxJRD4 [...] NjU8L0lEPjwvQkFDS1VQX0lEUz48L0JBQ0tVUFM+PC9WTT4= ' + /var/tmp/one/tm/lib/backup_qcow2.rb -l -d 0: -x /var/lib/one//dat
astores/0/84/backup/vm.xml -p /var/lib/one//datastores/0/84
Thu May  4 08:51:08 2023 [Z0][VM][I]: New LCM state is RUNNING

Sir, dont you think it will be wise that OpenNebula check if the file (qcow2 disk) is writable before execute a backup? Regardless if the Gluster is guilty or not in this situation.

nachowork90 commented 1 year ago

Using I/O intensive VM for test we can get the same behavior, independently of the storage backend quality!

rsmontero commented 1 year ago

@Franco-Sparrow Note that we are NOT writing the original disk as we would overwrite VM disk. I think what is happening in your case is that gluster needs to flush some write operations in the disk so the actual lock of qemu is not released before issuing next command. That could be aligned to @nachowork90 comment.

Could you add the following (haven't tried myself...) but the idea is to not request the wirte-lock

diff --git a/src/tm_mad/lib/backup_qcow2.rb b/src/tm_mad/lib/backup_qcow2.rb
index 9d0a2deffd..d8ff48a328 100755
--- a/src/tm_mad/lib/backup_qcow2.rb
+++ b/src/tm_mad/lib/backup_qcow2.rb
@@ -614,7 +614,7 @@ class KVMDomain
         fsthaw

         qdisk.each do |did, disk|
-            disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2')
+            disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2', :U => '')
         end

         log("[BCK]: Full backup done in #{Time.now - init}s")
@@ -706,7 +706,7 @@ class KVMDomain
             sdisk = QemuImg.new("#{@vm_dir}/disk.#{did}")
             ddisk = "#{@bck_dir}/disk.#{did}.0"

-            sdisk.convert(ddisk, :m => '4', :O => 'qcow2')
+            sdisk.convert(ddisk, :m => '4', :O => 'qcow2', :U => '')

             if @checkpoint
                 sdisk.bitmaps.each {|bm| sdisk.bitmap(bm['name'], :remove => '') }

This should add -U (force-share) to convert operation

Franco-Sparrow commented 1 year ago

@Franco-Sparrow Note that we are NOT writing the original disk as we would overwrite VM disk. I think what is happening in your case is that gluster needs to flush some write operations in the disk so the actual lock of qemu is not released before issuing next command. That could be aligned to @nachowork90 comment.

Could you add the following (haven't tried myself...) but the idea is to not request the wirte-lock
diff --git a/src/tm_mad/lib/backup_qcow2.rb b/src/tm_mad/lib/backup_qcow2.rb
index 9d0a2deffd..d8ff48a328 100755
--- a/src/tm_mad/lib/backup_qcow2.rb
+++ b/src/tm_mad/lib/backup_qcow2.rb
@@ -614,7 +614,7 @@ class KVMDomain
         fsthaw

         qdisk.each do |did, disk|
-            disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2')
+            disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2', :U => '')
         end

         log("[BCK]: Full backup done in #{Time.now - init}s")
@@ -706,7 +706,7 @@ class KVMDomain
             sdisk = QemuImg.new("#{@vm_dir}/disk.#{did}")
             ddisk = "#{@bck_dir}/disk.#{did}.0"

-            sdisk.convert(ddisk, :m => '4', :O => 'qcow2')
+            sdisk.convert(ddisk, :m => '4', :O => 'qcow2', :U => '')

             if @checkpoint
                 sdisk.bitmaps.each {|bm| sdisk.bitmap(bm['name'], :remove => '') }
This should add -U (force-share) to convert operation

I will try it on the original file and will let you know the results.

nachowork90 commented 1 year ago

@Franco-Sparrow Note that we are NOT writing the original disk as we would overwrite VM disk. I think what is happening in your case is that gluster needs to flush some write operations in the disk so the actual lock of qemu is not released before issuing next command. That could be aligned to @nachowork90 comment.

Could you add the following (haven't tried myself...) but the idea is to not request the wirte-lock
diff --git a/src/tm_mad/lib/backup_qcow2.rb b/src/tm_mad/lib/backup_qcow2.rb
index 9d0a2deffd..d8ff48a328 100755
--- a/src/tm_mad/lib/backup_qcow2.rb
+++ b/src/tm_mad/lib/backup_qcow2.rb
@@ -614,7 +614,7 @@ class KVMDomain
         fsthaw

         qdisk.each do |did, disk|
-            disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2')
+            disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2', :U => '')
         end

         log("[BCK]: Full backup done in #{Time.now - init}s")
@@ -706,7 +706,7 @@ class KVMDomain
             sdisk = QemuImg.new("#{@vm_dir}/disk.#{did}")
             ddisk = "#{@bck_dir}/disk.#{did}.0"

-            sdisk.convert(ddisk, :m => '4', :O => 'qcow2')
+            sdisk.convert(ddisk, :m => '4', :O => 'qcow2', :U => '')

             if @checkpoint
                 sdisk.bitmaps.each {|bm| sdisk.bitmap(bm['name'], :remove => '') }
This should add -U (force-share) to convert operation

Perfect, this solves the problem, with this option, the issue has been gone!!

Thanks Team

Franco-Sparrow commented 1 year ago

I confirm from my side too. Thanks @rsmontero , no need of our workaround, yours is enough to make it work.

You may push the commit to the repo and close this issue if you need. Its solved now.

Lets work now in the other issue :)

Franco-Sparrow commented 1 year ago

@rsmontero The solution for this issue is the following one:

Edit the following file in the orchestrator leader:

sudo -u oneadmin nano /var/lib/one/remotes/tm/lib/backup_qcow2.rb

Press Alt + Shift + 3 to show the lines number.

Repleace the old ones with following lines:

# [...]
132
133         sleep(1)
134     end
# [...]
619             disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2', :U => '')
# [...]
711             sdisk.convert(ddisk, :m => '4', :O => 'qcow2', :U => '')
# [...]

Update the hosts:
```
sudo -u oneadmin onehost sync -f
```

Thanks you for your help.

PS: Remember to label this isssue and commit the changes once they have been verified by the team. This is an improvement to OpenNebula backup code, not an issue of the storage backend used in the cloud. Your storage backend could become slower in time with heavy load or due to the use of old drives with bad I/O performance. openNebula code must be ready to perform the backups even in this circunstances. With this changes Opennebula is able to backup a VM without "write lock" error, from RUNNING state and from POWEROFF state.

rsmontero commented 1 year ago

Change has been merged as per 78910010d35683864d3a479a2f0c6eb17c95229d

Thanks again for your valuable and detailed feedback :)

OpenNebula / one

Backup live "write lock" error and workaround for it #6199

Progress Status