apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
1.98k stars 1.09k forks source link

Failed Volume Snapshot shows State as BackedUp #8946

Open nischalnischal2020 opened 5 months ago

nischalnischal2020 commented 5 months ago
ISSUE TYPE
COMPONENT NAME
Snapshot
CLOUDSTACK VERSION
Failed snapshot due to hardware 
CONFIGURATION
ACS 4.17.2 with Ceph storge 
OS / ENVIRONMENT

ACS 4.17.2 with Ceph storge and Global Config "snapshot.backup.to.secondary = False"

SUMMARY

There was an error on the NFS server as the snapshot was being taken, the NFS server rebooted during snapshot process, The issue was that the state of the snapshot was shown as "Creating"

STEPS TO REPRODUCE
  1. Take a volume snapshot
  2. When the snapshot is in progress, reboot the NFS VM or disconnect the NFS server
  3. Ideally it snapshot should get into an Error state, but we got the state as BackedUp
EXPECTED RESULTS
We expected the snapshot to get into Error state, but it went into BackedUp state
ACTUAL RESULTS
![image](https://github.com/apache/cloudstack/assets/60923541/c293e8c5-a717-4aaa-9c15-28368f774dfd)
DaanHoogland commented 5 months ago

@nischalnischal2020 did you check the nfs share? and was the file there? what was the state of the file if it was? Did you find any errors in the management server logs or the agents logs for the SSVM and the host involved?

nischalnischal2020 commented 5 months ago

HI @DaanHoogland

There was no file in the secondary storage, besides this we use the Global parameter = "snapshot.backup.to.secondary = False" So snapshot files would remain in Primary Storage.

The logs show the error as

2024-03-22 15:47:49,643 DEBUG c.c.a.t.Request (logid:) Seq 19-8222165544694979585: Processing: { Ans: , MgmtId: 195808829246451, via: 19, Ver: v1, Flags: 10, [{"org.apache.cloudstack.storage.command.CopyCmdAnswer":{"result":"false","details":"org.apache.cloudstack.utils.qemu.QemuImgException: qemu-img: Could not open 'rbd:arch-int-vpc-prim/fe23b82c-e8c4-4a14-a4dc-6ea3d54a6c55@db6e00a4-882d-4ad5-b827-b1db5f1bb9e6:mon_host=172.20.202.10:auth_supported=cephx:id=stackusr:key=AQBGk7Fi8IrUDBAA2qvfs+QVVYJ0Ri8jAk7Hiw==:rbd_default_format=2:client_mount_timeout=30': error reading header from fe23b82c-e8c4-4a14-a4dc-6ea3d54a6c55: No such file or directory","wait":"0","bypassHostMaintenance":"false"}}] } 2024-03-22 15:47:49,643 DEBUG [c.c.a.t.Request] (API-Job-Executor-73:ctx-b37e560d job-63965 ctx-7185241a) (logid:606af370) Seq 19-8222165544694979585: Received: { Ans: , MgmtId: 195808829246451, via: 19(SBARCLD-INT-VPC5), Ver: v1, Flags: 10, { CopyCmdAnswer } } 2024-03-22 15:47:49,644 DEBUG [o.a.c.s.s.SnapshotServiceImpl] (API-Job-Executor-73:ctx-b37e560d job-63965 ctx-7185241a) (logid:606af370) Failed to copy snapshot java.lang.RuntimeException: InvocationTargetException when invoking RPC callback for command: copySnapshotAsyncCallback at org.apache.cloudstack.framework.async.AsyncCallbackDispatcher.dispatch(AsyncCallbackDispatcher.java:154) at org.apache.cloudstack.framework.async.InplaceAsyncCallbackDriver.performCompletionCallback(InplaceAsyncCallbackDriver.java:25) at org.apache.cloudstack.framework.async.AsyncCallbackDispatcher.complete(AsyncCallbackDispatcher.java:126) at org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.copyAsync(AncientDataMotionStrategy.java:534) at org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:84) at org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:106) at org.apache.cloudstack.storage.snapshot.SnapshotServiceImpl.backupSnapshot(SnapshotServiceImpl.java:283) at org.apache.cloudstack.storage.snapshot.DefaultSnapshotStrategy.backupSnapshot(DefaultSnapshotStrategy.java:177) at org.apache.cloudstack.snapshot.SnapshotHelper.backupSnapshotToSecondaryStorageIfNotExists(SnapshotHelper.java:134) at com.cloud.template.TemplateManagerImpl.createPrivateTemplate(TemplateManagerImpl.java:1644) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

DaanHoogland commented 5 months ago

And was there a file on primary storage?

also , in your stacktrace it says TemplateManagerImpl.createPrivateTemplate so if a template is to be created from a snapshot or a volume it would always have to be copied to secondary to do so.

harikrishna-patnala commented 3 months ago

@nischalnischal2020 please check the newly created PR #9239 to address an issue which I've observed while checking your issue here.

The issue I've observed is not while taking the snapshot but while creating the template from the snapshot (stack trace also refers the same).

I could not reproduce the original issue of failed snapshot showing as backedup state rather than error (it might have already fixed after 4.17.2), but I saw another serious issue.

The issue is whenever a snapshot is used to create a template or volume and if there is failure in backing up the snapshot to the secondary store and as part of handling that failure MS is deleting the snapshot in primary storage itself.

These changes are introduced as part of the PR https://github.com/apache/cloudstack/pull/5297

Create a snapshot of a volume (set snapshot.backup.to.secondary = False)
Create a template from that snapshot
As part of the creation, MS first tries to backup the snapshot to the secondary storage
I've made it fail
MS recognized the failure and as part of failure it is deleting the snapshot on the primary storage (also marking the snapshot_store_ref entry for primary store role as "Destroyed")
rohityadavcloud commented 2 weeks ago

Part addressed in https://github.com/apache/cloudstack/pull/9239 pl check and close the ticket, cc @nischalnischal2020 @harikrishna-patnala