Closed Franco-Sparrow closed 1 year ago
Is this hapening to you only after using the reset flag?
Is this hapening to you only after using the reset flag?
Hi Sir, thank you for the quick reply.
Is happening allways, it doesnt matter if reset flag was specified first or before. We made a workaround for it and fix it. Please, take a look at the code we added and the details.
Please include full VM log files to verify configuration issues, I think it could be related to your gluster setup
It is already check, as we make an snapshot before converting the disk, no the issue is this one:
Thu May 4 09:49:38 2023 [Z0][VMM][I]: Error: Error executing 'virsh --connect qemu:///system snapshot-create-as --name one-84-backup --disk-only --atomic --d
iskspec vda,file=/var/lib/one//datastores/0/84/tmp/overlay_0.qcow2 2a727440-613a-40c3-9b13-26c57517727f':
Thu May 4 09:49:38 2023 [Z0][VMM][I]:
Thu May 4 09:49:38 2023 [Z0][VMM][I]: error: operation failed: domain moment one-84-backup already exists
You can either delete that manaully or try a reset to clean the backup chain
I got the same issue, but I'm not able to reproduce in differents scenarios, the only common denominator is the storage backend locks release!
It is already check, as we make an snapshot before converting the disk, no the issue is this one:
Thu May 4 09:49:38 2023 [Z0][VMM][I]: Error: Error executing 'virsh --connect qemu:///system snapshot-create-as --name one-84-backup --disk-only --atomic --d iskspec vda,file=/var/lib/one//datastores/0/84/tmp/overlay_0.qcow2 2a727440-613a-40c3-9b13-26c57517727f': Thu May 4 09:49:38 2023 [Z0][VMM][I]: Thu May 4 09:49:38 2023 [Z0][VMM][I]: error: operation failed: domain moment one-84-backup already exists
You can either delete that manaully or try a reset to clean the backup chain
Wrong log. Sorry for that. I had that error after many tries, it created a snapshot during tests. But that error is not the one of the main post here. Here are the lines from the first test:
Thu May 4 08:50:58 2023 [Z0][VM][I]: New LCM state is BACKUP
Thu May 4 08:51:00 2023 [Z0][VMM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/tm/shared/prebackup_live ON-C2-N2-kvm:/var/lib/one//dat
astores/0/84 0: 2a727440-613a-40c3-9b13-26c57517727f 84 0
Thu May 4 08:51:00 2023 [Z0][VMM][E]: prebackup_live: Command failed:
Thu May 4 08:51:00 2023 [Z0][VMM][I]: export LANG=C
Thu May 4 08:51:00 2023 [Z0][VMM][I]: export LC_ALL=C
Thu May 4 08:51:00 2023 [Z0][VMM][I]: set -ex -o pipefail
Thu May 4 08:51:01 2023 [Z0][VMM][I]:
Thu May 4 08:51:01 2023 [Z0][VMM][I]: # ----------------------------------
Thu May 4 08:51:01 2023 [Z0][VMM][I]: # Prepare the tmp and backup folders
Thu May 4 08:51:01 2023 [Z0][VMM][I]: # ----------------------------------
Thu May 4 08:51:01 2023 [Z0][VMM][I]: [ -d /var/lib/one//datastores/0/84/tmp ] && rm -rf /var/lib/one//datastores/0/84/tmp
Thu May 4 08:51:01 2023 [Z0][VMM][I]:
Thu May 4 08:51:01 2023 [Z0][VMM][I]: [ -d /var/lib/one//datastores/0/84/backup ] && rm -rf /var/lib/one//datastores/0/84/backup
Thu May 4 08:51:01 2023 [Z0][VMM][I]:
Thu May 4 08:51:01 2023 [Z0][VMM][I]: mkdir -p /var/lib/one//datastores/0/84/tmp
Thu May 4 08:51:01 2023 [Z0][VMM][I]:
Thu May 4 08:51:01 2023 [Z0][VMM][I]: mkdir -p /var/lib/one//datastores/0/84/backup
Thu May 4 08:51:01 2023 [Z0][VMM][I]:
Thu May 4 08:51:01 2023 [Z0][VMM][I]: echo "PFZNPjxJRD [...]
Thu May 4 08:51:04 2023 [Z0][VMM][I]: Uz48SUQ+NjU8L0lEPjwvQkFDS1VQX0lEUz48L0JBQ0tVUFM+PC9WTT4=
Thu May 4 08:51:04 2023 [Z0][VMM][I]: " > /var/lib/one//datastores/0/84/backup/vm.xml
Thu May 4 08:51:04 2023 [Z0][VMM][I]:
Thu May 4 08:51:04 2023 [Z0][VMM][I]: # --------------------------------------
Thu May 4 08:51:04 2023 [Z0][VMM][I]: # Create backup live
Thu May 4 08:51:04 2023 [Z0][VMM][I]: # --------------------------------------
Thu May 4 08:51:04 2023 [Z0][VMM][I]: /var/tmp/one/tm/lib/backup_qcow2.rb -l -d "0:" -x /var/lib/one//datastores/0/84/backup/vm.xml -p /var/lib/one//datastor
es/0/84
Thu May 4 08:51:04 2023 [Z0][VMM][I]:
Thu May 4 08:51:04 2023 [Z0][VMM][I]: Error: Error executing 'qemu-img convert -m 4 -O qcow2 /var/lib/datastores/gluster_mount/0/84/disk.0.snap/0 /var/lib/on
e//datastores/0/84/backup/disk.0.0':
Thu May 4 08:51:04 2023 [Z0][VMM][I]: qemu-img: Could not open '/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0': Failed to get shared "write" lock
Thu May 4 08:51:04 2023 [Z0][VMM][I]: Is another process using the image [/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0]?
Thu May 4 08:51:04 2023 [Z0][VMM][I]: Error preparing disk files: Error executing 'qemu-img convert -m 4 -O qcow2 /var/lib/datastores/gluster_mount/0/84/disk
.0.snap/0 /var/lib/one//datastores/0/84/backup/disk.0.0':
Thu May 4 08:51:04 2023 [Z0][VMM][I]: qemu-img: Could not open '/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0': Failed to get shared "write" lock
Thu May 4 08:51:04 2023 [Z0][VMM][I]: Is another process using the image [/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0]?
Thu May 4 08:51:04 2023 [Z0][VMM][I]: + '[' -d /var/lib/one//datastores/0/84/tmp ']'
Thu May 4 08:51:04 2023 [Z0][VMM][I]: + '[' -d /var/lib/one//datastores/0/84/backup ']'
Thu May 4 08:51:04 2023 [Z0][VMM][I]: + mkdir -p /var/lib/one//datastores/0/84/tmp
Thu May 4 08:51:04 2023 [Z0][VMM][I]: + mkdir -p /var/lib/one//datastores/0/84/backup
Thu May 4 08:51:04 2023 [Z0][VMM][I]: + echo 'PFZNPjxJRD44NDwvSUQ+PFVJR [...]
Thu May 4 08:51:08 2023 [Z0][VMM][I]: Uz48SUQ+NjU8L0lEPjwvQkFDS1VQX0lEUz48L0JBQ0tVUFM+PC9WTT4=
Thu May 4 08:51:08 2023 [Z0][VMM][I]: '
Thu May 4 08:51:08 2023 [Z0][VMM][I]: + /var/tmp/one/tm/lib/backup_qcow2.rb -l -d 0: -x /var/lib/one//datastores/0/84/backup/vm.xml -p /var/lib/one//datastor
es/0/84
Thu May 4 08:51:08 2023 [Z0][VMM][I]: Failed to execute transfer manager driver operation: prebackup_live.
Thu May 4 08:51:08 2023 [Z0][VMM][E]: BACKUP: ERROR: prebackup_live: Command failed: export LANG=C export LC_ALL=C set -ex -o pipefail # -------------------
--------------- # Prepare the tmp and backup folders # ---------------------------------- [ -d /var/lib/one//datastores/0/84/tmp ] && rm -rf /var/lib/one//dat
astores/0/84/tmp [ -d /var/lib/one//datastores/0/84/backup ] && rm -rf /var/lib/one//datastores/0/84/backup mkdir -p /var/lib/one//datastores/0/84/tmp mkdi
r -p /var/lib/one//datastores/0/84/backup echo "PFZNPjxJRD44 [...] " > /var/lib/one//datastores/0/84/backup/vm.xml # -------------------------------------- # Create backup live # ------------------------------------
-- /var/tmp/one/tm/lib/backup_qcow2.rb -l -d "0:" -x /var/lib/one//datastores/0/84/backup/vm.xml -p /var/lib/one//datastores/0/84 Error: Error executing 'qem
u-img convert -m 4 -O qcow2 /var/lib/datastores/gluster_mount/0/84/disk.0.snap/0 /var/lib/one//datastores/0/84/backup/disk.0.0': qemu-img: Could not open '/v
ar/lib/datastores/gluster_mount/0/84/disk.0.snap/0': Failed to get shared "write" lock Is another process using the image [/var/lib/datastores/gluster_mount/0
/84/disk.0.snap/0]? Error preparing disk files: Error executing 'qemu-img convert -m 4 -O qcow2 /var/lib/datastores/gluster_mount/0/84/disk.0.snap/0 /var/lib/
one//datastores/0/84/backup/disk.0.0': qemu-img: Could not open '/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0': Failed to get shared "write" lock Is
another process using the image [/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0]? + '[' -d /var/lib/one//datastores/0/84/tmp ']' + '[' -d /var/lib/one/
/datastores/0/84/backup ']' + mkdir -p /var/lib/one//datastores/0/84/tmp + mkdir -p /var/lib/one//datastores/0/84/backup + echo 'PFZNPjxJRD4 [...] NjU8L0lEPjwvQkFDS1VQX0lEUz48L0JBQ0tVUFM+PC9WTT4= ' + /var/tmp/one/tm/lib/backup_qcow2.rb -l -d 0: -x /var/lib/one//dat
astores/0/84/backup/vm.xml -p /var/lib/one//datastores/0/84
Thu May 4 08:51:08 2023 [Z0][VM][I]: New LCM state is RUNNING
Sir, dont you think it will be wise that OpenNebula check if the file (qcow2 disk) is writable before execute a backup? Regardless if the Gluster is guilty or not in this situation.
Using I/O intensive VM for test we can get the same behavior, independently of the storage backend quality!
@Franco-Sparrow Note that we are NOT writing the original disk as we would overwrite VM disk. I think what is happening in your case is that gluster needs to flush some write operations in the disk so the actual lock of qemu is not released before issuing next command. That could be aligned to @nachowork90 comment.
Could you add the following (haven't tried myself...) but the idea is to not request the wirte-lock
diff --git a/src/tm_mad/lib/backup_qcow2.rb b/src/tm_mad/lib/backup_qcow2.rb
index 9d0a2deffd..d8ff48a328 100755
--- a/src/tm_mad/lib/backup_qcow2.rb
+++ b/src/tm_mad/lib/backup_qcow2.rb
@@ -614,7 +614,7 @@ class KVMDomain
fsthaw
qdisk.each do |did, disk|
- disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2')
+ disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2', :U => '')
end
log("[BCK]: Full backup done in #{Time.now - init}s")
@@ -706,7 +706,7 @@ class KVMDomain
sdisk = QemuImg.new("#{@vm_dir}/disk.#{did}")
ddisk = "#{@bck_dir}/disk.#{did}.0"
- sdisk.convert(ddisk, :m => '4', :O => 'qcow2')
+ sdisk.convert(ddisk, :m => '4', :O => 'qcow2', :U => '')
if @checkpoint
sdisk.bitmaps.each {|bm| sdisk.bitmap(bm['name'], :remove => '') }
This should add -U (force-share) to convert operation
@Franco-Sparrow Note that we are NOT writing the original disk as we would overwrite VM disk. I think what is happening in your case is that gluster needs to flush some write operations in the disk so the actual lock of qemu is not released before issuing next command. That could be aligned to @nachowork90 comment.
Could you add the following (haven't tried myself...) but the idea is to not request the wirte-lock
diff --git a/src/tm_mad/lib/backup_qcow2.rb b/src/tm_mad/lib/backup_qcow2.rb index 9d0a2deffd..d8ff48a328 100755 --- a/src/tm_mad/lib/backup_qcow2.rb +++ b/src/tm_mad/lib/backup_qcow2.rb @@ -614,7 +614,7 @@ class KVMDomain fsthaw qdisk.each do |did, disk| - disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2') + disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2', :U => '') end log("[BCK]: Full backup done in #{Time.now - init}s") @@ -706,7 +706,7 @@ class KVMDomain sdisk = QemuImg.new("#{@vm_dir}/disk.#{did}") ddisk = "#{@bck_dir}/disk.#{did}.0" - sdisk.convert(ddisk, :m => '4', :O => 'qcow2') + sdisk.convert(ddisk, :m => '4', :O => 'qcow2', :U => '') if @checkpoint sdisk.bitmaps.each {|bm| sdisk.bitmap(bm['name'], :remove => '') }
This should add -U (force-share) to convert operation
I will try it on the original file and will let you know the results.
@Franco-Sparrow Note that we are NOT writing the original disk as we would overwrite VM disk. I think what is happening in your case is that gluster needs to flush some write operations in the disk so the actual lock of qemu is not released before issuing next command. That could be aligned to @nachowork90 comment.
Could you add the following (haven't tried myself...) but the idea is to not request the wirte-lock
diff --git a/src/tm_mad/lib/backup_qcow2.rb b/src/tm_mad/lib/backup_qcow2.rb index 9d0a2deffd..d8ff48a328 100755 --- a/src/tm_mad/lib/backup_qcow2.rb +++ b/src/tm_mad/lib/backup_qcow2.rb @@ -614,7 +614,7 @@ class KVMDomain fsthaw qdisk.each do |did, disk| - disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2') + disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2', :U => '') end log("[BCK]: Full backup done in #{Time.now - init}s") @@ -706,7 +706,7 @@ class KVMDomain sdisk = QemuImg.new("#{@vm_dir}/disk.#{did}") ddisk = "#{@bck_dir}/disk.#{did}.0" - sdisk.convert(ddisk, :m => '4', :O => 'qcow2') + sdisk.convert(ddisk, :m => '4', :O => 'qcow2', :U => '') if @checkpoint sdisk.bitmaps.each {|bm| sdisk.bitmap(bm['name'], :remove => '') }
This should add -U (force-share) to convert operation
Perfect, this solves the problem, with this option, the issue has been gone!!
Thanks Team
I confirm from my side too. Thanks @rsmontero , no need of our workaround, yours is enough to make it work.
You may push the commit to the repo and close this issue if you need. Its solved now.
Lets work now in the other issue :)
@rsmontero The solution for this issue is the following one:
sudo -u oneadmin nano /var/lib/one/remotes/tm/lib/backup_qcow2.rb
Press Alt + Shift + 3 to show the lines number.
Repleace the old ones with following lines:
# [...]
132
133 sleep(1)
134 end
# [...]
619 disk.convert("#{@bck_dir}/disk.#{did}.0", :m => '4', :O => 'qcow2', :U => '')
# [...]
711 sdisk.convert(ddisk, :m => '4', :O => 'qcow2', :U => '')
# [...]
Update the hosts:
sudo -u oneadmin onehost sync -f
Thanks you for your help.
PS: Remember to label this isssue and commit the changes once they have been verified by the team. This is an improvement to OpenNebula backup code, not an issue of the storage backend used in the cloud. Your storage backend could become slower in time with heavy load or due to the use of old drives with bad I/O performance. openNebula code must be ready to perform the backups even in this circunstances. With this changes Opennebula is able to backup a VM without "write lock" error, from RUNNING
state and from POWEROFF
state.
Change has been merged as per 78910010d35683864d3a479a2f0c6eb17c95229d
Thanks again for your valuable and detailed feedback :)
Description
Baackups live can't be done because of
write lock
of the qcow2 disk.To Reproduce
Backup options for VM
84
:System datastore:
Images datastore:
Gluster volume info:
Trying to make a backup live with
--reset
:The following is a snippet from the
/var/log/one/84.log
after failed the backup action:Trying to make a backup live without
--reset
:In both cases the main concern is this one:
Failed to get shared "write" lock Is another process using the image [/var/lib/datastores/gluster_mount/0/84/disk.0.snap/0]?
Expected behavior
Backup without the "write lock" error.
Proposed solution
Thanks to nacho for the help here :)
Edit the file
/var/lib/one/remotes/tm/lib/backup_qcow2.rb
on all the orchestrators:Leave this as follow (line 24 edited):
Add the folling code from line 563 to 573 (definition waitState), just right before the definition backup_full_live:
Add the following code, just after the line 628 (new lines added from line 629 to 637):
Update the hosts:
The output is as follow:
Run the backup again:
The VM logfile
/var/log/one/84.log
will show the success of the FULL backup:Run an incremental backup now:
The VM logfile
/var/log/one/84.log
will show the success of the INCREMENTAL backup:The workaround reconfigure OpenNebula to wait for the file (VM qcow2 disk) to be writeable until the count arrives to 120s (2 min). If the file is
write lock
, it will wait a period of 5s before try it again. If after 24 periods (120s) the file was not released to be writeable, then it will fail the backup, just as it failed before the workaround.This was done for a cluster with external storage (Gluster Distributed-Disperse), shared over a Gigabitethernet network and SSD SATA drives.
Details
Progress Status