apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
1.97k stars 1.09k forks source link

After using Migrate Data from Image Store, new snapshots on moved disks failing #9430

Open nxsbi opened 1 month ago

nxsbi commented 1 month ago
ISSUE TYPE
COMPONENT NAME
UI
CLOUDSTACK VERSION
4.19
CONFIGURATION

XCP-NG

OS / ENVIRONMENT

NA

SUMMARY

After adding new secondary store, I moved (balanced) from existing secondary store to new secondary store. The move completed successfully. However, new snapshots are all failing for the snapshots that moved over to the new secondary store. In logs, It appears Cloudstack is trying to mount the old NFS server's snapshot folder path. However the cloud.snapshot_store_ref shows the correct new path for all snapshots (on new NFS server).

Note that the new NFS share is from Truenas, while old one was NFS share from CentOS 7 linux - not that it should make any difference.

STEPS TO REPRODUCE

Have existing recurring snapshots (Daily). Add a new secondary storage, move to balance. The volumes that moved to new storage fail
EXPECTED RESULTS
Snapshots continue to function
ACTUAL RESULTS
Snapshot job fails. Below is a sample error

```
2024-07-22 09:38:55,499 DEBUG [c.c.s.r.StorageSubsystemCommandHandlerBase] (DirectAgent-476:ctx-2f182f84) (logid:66082af1) Executing command CopyCommand: [{"srcTO":{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"path":"360da96e-e989-40b1-ac6d-0b7d45019d9e","volume":{"uuid":"ff1a290f-5bbc-4477-be99-9b7b83d09fed","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"0dbbc58c-fc4c-31a6-9787-5ce215ade6f9","name":"Primary Storage CargoHold3","id":13,"poolType":"NetworkFilesystem","host":"XX.XX.XX.31","path":"/mnt/data/vmdata3/prod/primary","port":2049,"url":"NetworkFilesystem://XX.XX.XX.31/mnt/data/vmdata3/prod/primary/?ROLE=Primary&STOREUUID=0dbbc58c-fc4c-31a6-9787-5ce215ade6f9","isManaged":false}},"name":"DATA-999","size":536870912000,"path":"371ca470-c808-4cec-99b8-fd43bdbc73fd","volumeId":424,"vmName":"i-100-176-VM","accountId":10,"format":"VHD","provisioningType":"THIN","poolId":13,"id":424,"deviceId":7,"cacheMode":"NONE","hypervisorType":"XenServer","directDownload":false,"deployAsIs":false,"followRedirects":true},"parentSnapshotPath":"687d708d-2dec-45ac-a20c-cc697c32b82a","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"0dbbc58c-fc4c-31a6-9787-5ce215ade6f9","name":"Primary Storage CargoHold3","id":13,"poolType":"NetworkFilesystem","host":"XX.XX.XX.31","path":"/mnt/data/vmdata3/prod/primary","port":2049,"url":"NetworkFilesystem://XX.XX.XX.31/mnt/data/vmdata3/prod/primary/?ROLE=Primary&STOREUUID=0dbbc58c-fc4c-31a6-9787-5ce215ade6f9","isManaged":false}},"vmName":"i-100-176-VM","name":"VMTest_DATA-999_20240722163619","hypervisorType":"XenServer","id":28689,"quiescevm":false,"parents":["687d708d-2dec-45ac-a20c-cc697c32b82a"],"physicalSize":0,"accountId":10,"followRedirects":false}},"destTO":{"org.apache.cloudstack.storage.to.SnapshotObjectTO":{"path":"snapshots/10/424","volume":{"uuid":"ff1a290f-5bbc-4477-be99-9b7b83d09fed","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"0dbbc58c-fc4c-31a6-9787-5ce215ade6f9","name":"Primary Storage CargoHold3","id":13,"poolType":"NetworkFilesystem","host":"XX.XX.XX.31","path":"/mnt/data/vmdata3/prod/primary","port":2049,"url":"NetworkFilesystem://XX.XX.XX.31/mnt/data/vmdata3/prod/primary/?ROLE=Primary&STOREUUID=0dbbc58c-fc4c-31a6-9787-5ce215ade6f9","isManaged":false}},"name":"DATA-999","size":536870912000,"path":"371ca470-c808-4cec-99b8-fd43bdbc73fd","volumeId":424,"vmName":"i-100-176-VM","accountId":10,"format":"VHD","provisioningType":"THIN","poolId":13,"id":424,"deviceId":7,"cacheMode":"NONE","hypervisorType":"XenServer","directDownload":false,"deployAsIs":false,"followRedirects":true},"dataStore":{"com.cloud.agent.api.to.NfsTO":{"_url":"nfs://XX.XX.XX.31/mnt/data/vmdata3/prod/secondary","_role":"Image"}},"vmName":"i-100-176-VM","name":"VMTest_DATA-999_20240722163619","hypervisorType":"XenServer","id":28689,"quiescevm":false,"physicalSize":0,"accountId":10,"followRedirects":false}},"executeInSequence":true,"options":{"snapshot.backup.to.secondary":"true","fullSnapshot":"false"},"options2":{},"wait":43200,"bypassHostMaintenance":false}].
2024-07-22 09:38:56,598 DEBUG [c.c.h.x.r.XenServerStorageProcessor] (DirectAgent-476:ctx-2f182f84) (logid:66082af1) No file SR found for path: /var/cloud_mount/5926ce99-6aec-3e14-9737-3cf5a2e1c3e1/snapshots/10/424
2024-07-22 09:38:56,599 DEBUG [c.c.h.x.r.XenServerStorageProcessor] (DirectAgent-476:ctx-2f182f84) (logid:66082af1) Creating file SR for path [/var/cloud_mount/5926ce99-6aec-3e14-9737-3cf5a2e1c3e1/snapshots/10/424] on host [327366b9-a80c-496c-9fd2-1547c4c4465a]
2024-07-22 09:38:57,391 DEBUG [c.c.h.x.r.XenServerStorageProcessor] (DirectAgent-476:ctx-2f182f84) (logid:66082af1) Unpluging PBD [f4a04899-a9e9-99ad-06d4-c7b9443afb11] of SR [f6063a1e-bfc8-3bb0-b62b-9932e5edfaf0] as it is not working properly.
2024-07-22 09:38:57,411 DEBUG [c.c.h.x.r.XenServerStorageProcessor] (DirectAgent-476:ctx-2f182f84) (logid:66082af1) Forgetting SR [f6063a1e-bfc8-3bb0-b62b-9932e5edfaf0] as it is not working properly.
2024-07-22 09:38:57,472 DEBUG [c.c.h.x.r.XenServerStorageProcessor] (DirectAgent-476:ctx-2f182f84) (logid:66082af1) Could not create file SR [/var/cloud_mount/5926ce99-6aec-3e14-9737-3cf5a2e1c3e1/snapshots/10/424] on host [327366b9-a80c-496c-9fd2-1547c4c4465a].
2024-07-22 09:38:57,474 DEBUG [c.c.h.x.r.XenServerStorageProcessor] (DirectAgent-476:ctx-2f182f84) (logid:66082af1) Exception in backupsnapshot stage due to com.cloud.utils.exception.CloudRuntimeException: Could not retrieve an already used file SR for path [/var/cloud_mount/5926ce99-6aec-3e14-9737-3cf5a2e1c3e1/snapshots/10/424] or create a new file SR on host [327366b9-a80c-496c-9fd2-1547c4c4465a]
2024-07-22 09:38:57,474 WARN  [c.c.h.x.r.XenServerStorageProcessor] (DirectAgent-476:ctx-2f182f84) (logid:66082af1) BackupSnapshot Failed due to Exception in backupsnapshot stage due to com.cloud.utils.exception.CloudRuntimeException: Could not retrieve an already used file SR for path [/var/cloud_mount/5926ce99-6aec-3e14-9737-3cf5a2e1c3e1/snapshots/10/424] or create a new file SR on host [327366b9-a80c-496c-9fd2-1547c4c4465a]
2024-07-22 09:38:58,438 DEBUG [c.c.a.m.DirectAgentAttache] (DirectAgent-476:ctx-2f182f84) (logid:66082af1) Seq 11-6261129381952313782: Response Received:
2024-07-22 09:38:58,439 DEBUG [c.c.a.t.Request] (DirectAgent-476:ctx-2f182f84) (logid:66082af1) Seq 11-6261129381952313782: Processing:  { Ans: , MgmtId: 196482732657985, via: 11(xc-host100), Ver: v1, Flags: 110, [{"org.apache.cloudstack.storage.command.CopyCmdAnswer":{"result":"false","details":"BackupSnapshot Failed due to Exception in backupsnapshot stage due to com.cloud.utils.exception.CloudRuntimeException: Could not retrieve an already used file SR for path [/var/cloud_mount/5926ce99-6aec-3e14-9737-3cf5a2e1c3e1/snapshots/10/424] or create a new file SR on host [327366b9-a80c-496c-9fd2-1547c4c4465a]","wait":"0","bypassHostMaintenance":"false"}}] }
2024-07-22 09:38:58,439 DEBUG [c.c.a.m.AgentAttache] (DirectAgent-476:ctx-2f182f84) (logid:66082af1) Seq 11-6261129381952313782: No more commands found
2024-07-22 09:38:58,439 DEBUG [c.c.a.t.Request] (Work-Job-Executor-152:ctx-f87ab687 job-71705/job-71706 ctx-8efd6070) (logid:66082af1) Seq 11-6261129381952313782: Received:  { Ans: , MgmtId: 196482732657985, via: 11(xc-host100), Ver: v1, Flags: 110, { CopyCmdAnswer } }
2024-07-22 09:38:58,552 DEBUG [c.c.s.s.SnapshotManagerImpl] (Work-Job-Executor-152:ctx-f87ab687 job-71705/job-71706 ctx-8efd6070) (logid:66082af1) Failed to create snapshotBackupSnapshot Failed due to Exception in backupsnapshot stage due to com.cloud.utils.exception.CloudRuntimeException: Could not retrieve an already used file SR for path [/var/cloud_mount/5926ce99-6aec-3e14-9737-3cf5a2e1c3e1/snapshots/10/424] or create a new file SR on host [327366b9-a80c-496c-9fd2-1547c4c4465a]
2024-07-22 09:38:58,553 DEBUG [c.c.r.ResourceLimitManagerImpl] (Work-Job-Executor-152:ctx-f87ab687 job-71705/job-71706 ctx-8efd6070) (logid:66082af1) Updating resource Type = snapshot count for Account = 10 Operation = decreasing Amount = 1
2024-07-22 09:38:58,618 DEBUG [c.c.r.ResourceLimitManagerImpl] (Work-Job-Executor-152:ctx-f87ab687 job-71705/job-71706 ctx-8efd6070) (logid:66082af1) Updating resource Type = secondary_storage count for Account = 10 Operation = decreasing Amount = (500.00 GB) 536870912000
2024-07-22 09:38:58,656 ERROR [o.a.c.s.v.VolumeServiceImpl] (Work-Job-Executor-152:ctx-f87ab687 job-71705/job-71706 ctx-8efd6070) (logid:66082af1) Take snapshot: 424 failed
2024-07-22 09:38:58,661 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-152:ctx-f87ab687 job-71705/job-71706 ctx-8efd6070) (logid:66082af1) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: BackupSnapshot Failed due to Exception in backupsnapshot stage due to com.cloud.utils.exception.CloudRuntimeException: Could not retrieve an already used file SR for path [/var/cloud_mount/5926ce99-6aec-3e14-9737-3cf5a2e1c3e1/snapshots/10/424] or create a new file SR on host [327366b9-a80c-496c-9fd2-1547c4c4465a]
2024-07-22 09:38:58,661 INFO  [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-152:ctx-f87ab687 job-71705/job-71706 ctx-8efd6070) (logid:66082af1) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: BackupSnapshot Failed due to Exception in backupsnapshot stage due to com.cloud.utils.exception.CloudRuntimeException: Could not retrieve an already used file SR for path [/var/cloud_mount/5926ce99-6aec-3e14-9737-3cf5a2e1c3e1/snapshots/10/424] or create a new file SR on host [327366b9-a80c-496c-9fd2-1547c4c4465a]
2024-07-22 09:38:58,661 DEBUG [c.c.v.VmWorkJobDispatcher] (Work-Job-Executor-152:ctx-f87ab687 job-71705/job-71706) (logid:66082af1) Done with run of VM work job: com.cloud.vm.VmWorkTakeVolumeSnapshot for VM 176, job origin: 71705

```
Pearl1594 commented 1 month ago

@nxsbi I gather this is a Xen environment. Are there a chain of snapshots present for the given volume? If yes, have all the snapshots belonging to the chain been migrated to the newly added secondary store? the destination storage where the new volume snapshot is backed up to is decided based on the location of the previous snapshot: https://github.com/apache/cloudstack/blob/4.19.1.0/engine/storage/snapshot/src/main/java/org/apache/cloudstack/storage/snapshot/DefaultSnapshotStrategy.java#L151.

nxsbi commented 1 month ago

@nxsbi I gather this is a Xen environment. Are there a chain of snapshots present for the given volume? If yes, have all the snapshots belonging to the chain been migrated to the newly added secondary store? the destination storage where the new volume snapshot is backed up to is decided based on the location of the previous snapshot: https://github.com/apache/cloudstack/blob/4.19.1.0/engine/storage/snapshot/src/main/java/org/apache/cloudstack/storage/snapshot/DefaultSnapshotStrategy.java#L151.

Hello, Yes, all the snapshots have moved over - there is no more folder for volume Id under the snapshots/account/ in the old storage. All of them are in the new storage.

One thing I noticed is that the records in the snapshot_store_ref for these volumes show parent_snapshot_id as 0 - which is usually only when it’s the first full snapshot, and second these records show a Physical_size =0.

I went through the code, but I couldn’t determine what query is actually getting executed to get the store id. Can you provide what query is actually getting run ? I can try running that and figure out what records are coming back and see if I can troubleshoot this

Pearl1594 commented 1 month ago

Based on the observation you made, it seems like the following piece of code didn't run: https://github.com/apache/cloudstack/blob/4.19.1.0/engine/storage/image/src/main/java/org/apache/cloudstack/storage/image/SecondaryStorageServiceImpl.java#L223-L227 . Did the migration operation complete successfully? Have the original records of the snapshot (in snapshot_store_ref) been deleted? The store chosen during creation of a snapshot is determined from the parent snapshot - this is the code snip that helps in getting the secondary store for backing up the snapshot https://github.com/apache/cloudstack/blob/4.19.1.0/engine/storage/snapshot/src/main/java/org/apache/cloudstack/storage/snapshot/SnapshotServiceImpl.java#L298