apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
1.97k stars 1.09k forks source link

Xenserver/XCP-NG Volume Migration - Non-managed (NFS) <--> Managed (Solidfire) storage #5915

Open tsinik-dw opened 2 years ago

tsinik-dw commented 2 years ago
ISSUE TYPE
COMPONENT NAME
API, UI
CLOUDSTACK VERSION
4.16.0
CONFIGURATION

ACS 4.16.0 1 Zone Cluster A: Two Xenserver 7.0 hosts Cluster B: Two XCP-NG 8.2 hosts

Each Cluster has each own NFS primary storage (non-managed storage) There is a Zone-wide Solidfire Storage (managed storage)

OS / ENVIRONMENT

VM-A1, on Cluster A, has 1 ROOT disk and 1 DATA disk. Both disks on NFS Primary storage VM-A2, on Cluster A, has 1 ROOT disk and 1 DATA disk. Both disks on Solifire storage VM-B1, on Cluster B, has 1 ROOT disk and 1 DATA disk. Both disks on NFS Primary storage VM-B2, on Cluster B, has 1 ROOT disk and 1 DATA disk. Both disks on Solifire storage

SUMMARY

We want to migrate VM DATA volumes between NFS primary storage (non-managed storage) and Solidfire storage (managed storage), both ways.

  1. Trying to migrate VM-A1, VM-B1 Volumes from non-managed --> Managed (TRIED WITH VMs IN RUNNING AND STOPPED STATE, SAME RESULT)

    The UI does not offer any available storage choice and we get the following message:

    No primary storage pools available for migration
  2. Trying to migrate VM-A2, VM-B2 Volumes from Managed --> Unmanaged: (TRIED WITH VMs IN RUNNING AND STOPPED STATE, SAME RESULT)

    We get the following message:

    Migrating volume failed
    Resource [StoragePool:1] is unreachable: Migrate volume failed: com.cloud.utils.exception.CloudRuntimeException: 
    Migration operation failed in 'StorageSystemDataMotionStrategy.handleVolumeMigrationFromManagedStorageToNonManagedStorage': 
    Currently, only the KVM hypervisor type is supported for the migration of a volume from managed storage to non-managed storage.>

    It turns out that this feature is only supported on KVM

STEPS TO REPRODUCE
1. Create VM1 with DATA volume on Solidfire storage
2. Create VM2 with DATA volume on NFS primary storage
3. Try to migrate VM1 DATA volume from  Solidfire to NFS Storage
4. Try to migrate VM2 DATA volume from NFS to Solidfire Storage
EXPECTED RESULTS
DATA volume should be migrated
ACTUAL RESULTS
We want to migrate VM DATA volumes between NFS primary storage (non-managed storage) and Solidfire storage (managed storage), both ways.

1. Trying to migrate VM-A1, VM-B1 Volumes from non-managed  --> Managed (TRIED WITH VMs IN RUNNING AND STOPPED STATE, SAME RESULT)

    The UI does not offer any choice and we get the following message:
    No primary storage pools available for migration

2. Trying to migrate VM-A2, VM-B2 Volumes from Managed --> Unmanaged: (TRIED WITH VMs IN RUNNING AND STOPPED STATE, SAME RESULT)

   We get the following message: 
   Migrating volume failed
    Resource [StoragePool:1] is unreachable: Migrate volume failed: com.cloud.utils.exception.CloudRuntimeException: 
    Migration operation failed in 'StorageSystemDataMotionStrategy.handleVolumeMigrationFromManagedStorageToNonManagedStorage': 
    Currently, only the KVM hypervisor type is supported for the migration of a volume from managed storage to non-managed storage.>

    It turns out that this feature is only supported on KVM
tsinik-dw commented 2 years ago

I should also mention that volume migration from non-managed to managed storage was functional in ACS 4.13.1, following the steps described by Mike Tutkowski in https://youtu.be/lkVMb6elvz4 (On 31:25 the actual migration is performed).

nvazquez commented 2 years ago

Hi @tsinik-dw For case number 1) can you try migrate volume API specifying the target storage pool UUID as a parameter instead of trying through the UI? (this seems like a UI bug) For case number 2) it is not supported

tsinik-dw commented 2 years ago

Hi @nvazquez,

I just tried the volume migration for case number 1, using cmk but it didn't work.

The cmk command and output is:

(noc-dev) 🐱 > migrate volume storageid=2514b65e-b231-4b2e-932c-c897f2df7c79 volumeid=09243386-b5f2-4920-afc3-3505d2ee311c livemigrate=true newdiskofferingid=5b764ddb-ea60-40b1-8ff5-586953266e92
{
  "accountid": "d0987ed7-8031-11ec-9ad0-ba21ccf13580",
  "cmd": "org.apache.cloudstack.api.command.admin.volume.MigrateVolumeCmdByAdmin",
  "completed": "2022-03-03T10:08:59+0200",
  "created": "2022-03-03T10:07:25+0200",
  "jobid": "d1d2f226-d195-4a8b-970a-1570173a1d76",
  "jobprocstatus": 0,
  "jobresult": {
    "errorcode": 530,
    "errortext": "Resource [StoragePool:2] is unreachable: Migrate volume failed: com.cloud.utils.exception.CloudRuntimeException: Migration operation failed in 'StorageSystemDataMotionStrategy.handleVolumeMigrationFromNonManagedStorageToManagedStorage': Failed to migrate volume with ID 132 to storage pool with ID 2"
  },
  "jobresultcode": 530,
  "jobresulttype": "object",
  "jobstatus": 2,
  "userid": "d09ca276-8031-11ec-9ad0-ba21ccf13580"
}

πŸ™ˆ Error: async API failed for job d1d2f226-d195-4a8b-970a-1570173a1d76

I also attach the management log and storage_pool table (in CSV): nv_vol_migr_to_managed_cmk.txt storage_pool.txt

nvazquez commented 2 years ago

Thanks for the logs @tsinik-dw! It seems simply the pool is out of space according to the error thrown:

2022-03-03 10:08:57,148 ERROR [c.c.h.x.r.w.x.XenServer610MigrateVolumeCommandWrapper] (DirectAgent-242:ctx-0c966ba9) (logid:d1d2f226) Caught exception com.xensource.xenapi.Types$BadAsyncResult due to the following: Task failed! Task record:                 uuid: 0d12e02c-7f8f-6c11-603c-f04e3b8e1dc1 
           nameLabel: Async.VDI.pool_migrate 
     nameDescription:  
   allowedOperations: [] 
   currentOperations: {} 
             created: Thu Mar 03 10:08:15 EET 2022 
            finished: Thu Mar 03 10:08:43 EET 2022 
              status: failure 
          residentOn: com.xensource.xenapi.Host@6634ea40 
            progress: 1.0 
                type: <none/> 
              result:  
           errorInfo: [SR_BACKEND_FAILURE_44, , There is insufficient space] 
         otherConfig: {} 
           subtaskOf: com.xensource.xenapi.Task@aaf13f6f 
            subtasks: [] 

Task failed! Task record:                 uuid: 0d12e02c-7f8f-6c11-603c-f04e3b8e1dc1 
           nameLabel: Async.VDI.pool_migrate 
     nameDescription:  
   allowedOperations: [] 
   currentOperations: {} 
             created: Thu Mar 03 10:08:15 EET 2022 
            finished: Thu Mar 03 10:08:43 EET 2022 
              status: failure 
          residentOn: com.xensource.xenapi.Host@6634ea40 
            progress: 1.0 
                type: <none/> 
              result:  
           errorInfo: [SR_BACKEND_FAILURE_44, , There is insufficient space] 
         otherConfig: {} 
           subtaskOf: com.xensource.xenapi.Task@aaf13f6f 
            subtasks: [] 
tsinik-dw commented 2 years ago

Hi @nvazquez,

This error message is a little weird. After repeating the same test today with a 2GB DATA volume and digging deeper into the logs, I came across the following error found on SMlog of the pool master:

Mar  4 13:39:30 xen8-c1 SM: [5541] vdi_create {'sr_uuid': 'e71497cb-a0b7-ac0d-f836-f363811663b6', 'subtask_of': 'DummyRef:|4cb983a2-be5a-0b1b-296e-08bbb8e53a57|VDI.create', 'vdi_type': 'user', 'args': ['2147483648', 'DATA-1815', '', '', 'false', '19700101T00:00:00Z', '', 'false'], 'host_ref': 'OpaqueRef:517abf46-0e48-9ed9-ba7a-188f8260a820', 'session_ref': 'OpaqueRef:607e2408-0bc7-8a3e-0306-8693e6fc0657', 'device_config': {'target': '192.168.70.233', 'multihomelist': '192.168.70.233:3260', 'targetIQN': 'iqn.2010-01.com.solidfire:slwz.data-1815.331', 'SRmaster': 'true', 'device': '/dev/disk/mpInuse/36f47acc100000000736c777a0000014b', 'SCSIid': '36f47acc100000000736c777a0000014b'}, 'command': 'vdi_create', 'sr_ref': 'OpaqueRef:9b3d9125-9a98-ebf1-6677-944554ad2c71', 'vdi_sm_config': {'base_mirror': '07d9936d-1b5f-e16b-080e-f41a33d452d2/06c8e955-2404-4fd7-86e1-8d65fba194f0'}}
     Mar  4 13:39:30 xen8-c1 SM: [5541] LVHDVDI.create for 0bb897ed-de6a-4c70-8fd3-aaaa23b4548c
     Mar  4 13:39:30 xen8-c1 SM: [5541] LVHDVDI.create: type = vhd, /dev/VG_XenStorage-e71497cb-a0b7-ac0d-f836-f363811663b6/VHD-0bb897ed-de6a-4c70-8fd3-aaaa23b4548c (size=2147483648)
     Mar  4 13:39:30 xen8-c1 SM: [5541] ['/sbin/vgs', '--noheadings', '--nosuffix', '--units', 'b', 'VG_XenStorage-e71497cb-a0b7-ac0d-f836-f363811663b6']
     Mar  4 13:39:30 xen8-c1 SM: [5541]   pread SUCCESS
     Mar  4 13:39:30 xen8-c1 SM: [5541] Not enough space! free space: 167772160, need: 2160066560
     Mar  4 13:39:30 xen8-c1 SM: [5541] Raising exception [44, There is insufficient space]
     Mar  4 13:39:30 xen8-c1 SM: [5541] lock: released /var/lock/sm/e71497cb-a0b7-ac0d-f836-f363811663b6/sr
     Mar  4 13:39:30 xen8-c1 SM: [5541] ***** generic exception: vdi_create: EXCEPTION <class 'SR.SROSError'>, There is insufficient space

The SR created on the destination pool was 2,2GB. It seems that it does not respect the hv_ss_reserve value on disk offering, so with this size, it can not create a snapshot of VDI for live migration. The hv_ss_reserve value in our case is 200.

nvazquez commented 2 years ago

@tsinik-dw thanks, I could validate that on the logs:

2022-03-03 10:07:25,351 DEBUG [c.c.s.StorageManagerImpl] (API-Job-Executor-34:ctx-95b16032 job-473 ctx-9a984c53) (logid:d1d2f226) Destination pool id: 2 
2022-03-03 10:07:25,360 DEBUG [c.c.s.StorageManagerImpl] (API-Job-Executor-34:ctx-95b16032 job-473 ctx-9a984c53) (logid:d1d2f226) Pool ID for the volume with ID 132 is 1 
2022-03-03 10:07:25,365 DEBUG [c.c.s.StorageManagerImpl] (API-Job-Executor-34:ctx-95b16032 job-473 ctx-9a984c53) (logid:d1d2f226) Found storage pool SOLIDFIRE of type Iscsi 
2022-03-03 10:07:25,366 DEBUG [c.c.s.StorageManagerImpl] (API-Job-Executor-34:ctx-95b16032 job-473 ctx-9a984c53) (logid:d1d2f226) Total capacity of the pool SOLIDFIRE with ID 2 is (60.00 GB) 64424476455 
2022-03-03 10:07:25,370 DEBUG [c.c.s.StorageManagerImpl] (API-Job-Executor-34:ctx-95b16032 job-473 ctx-9a984c53) (logid:d1d2f226) Checking pool: 2 for storage allocation , maxSize : (60.00 GB) 64424476455, totalAllocatedSize : (23.00 GB) 24696061952, askingSize : (2.20 GB) 2362232064, allocated disable threshold: 0.85 
2022-03-03 10:07:25,433 DEBUG [c.c.u.AccountManagerImpl] (API-Job-Executor-34:ctx-95b16032 job-473 ctx-9a984c53) (logid:d1d2f226) Access granted to Acct[d0987ed7-8031-11ec-9ad0-ba21ccf13580-admin] to com.cloud.storage.DiskOfferingVO$$EnhancerByCGLIB$$39e596b2@3ec58fe5 by AffinityGroupAccessChecker 

I checked the calculation for the asking size on the SolidFire provider uses the volume size and the volume hv_ss_reserve values from the volumes table. To calculate the asking size, CS adds volume size * (hv_ss_reserve / 100) to the volume size in bytes - if in your case hv_ss_reserve = 200 then it would mean that CS will ask for 3 times the volume size, around 700MB is that the volume size? Reference: https://github.com/apache/cloudstack/blob/4.16/plugins/storage/volume/solidfire/src/main/java/org/apache/cloudstack/storage/datastore/driver/SolidFirePrimaryDataStoreDriver.java#L449

Can you please share the DB output of:

tsinik-dw commented 2 years ago

Hi @nvazquez,

here is the output of the SQL queries:

select * from volumes where id = 132;

"id","account_id","domain_id","pool_id","last_pool_id","instance_id","device_id","name","uuid","size","folder","path","pod_id","data_center_id","iscsi_name","host_ip","volume_type","pool_type","disk_offering_id","template_id","first_snapshot_backup_uuid","recreatable","created","attached","updated","removed","state","chain_info","update_count","disk_type","vm_snapshot_chain_size","iso_id","display_volume","format","min_iops","max_iops","hv_ss_reserve","provisioning_type"
132,2,1,1,,,,DATAVOL-1,"09243386-b5f2-4920-afc3-3505d2ee311c",2147483648,,"0f7024ca-e21f-4398-a236-16536f978755",,1,,,DATADISK,IscsiLUN,21,,,0,2022-03-03 07:52:32,,2022-03-03 12:32:20,2022-03-03 12:32:20,Expunged,,14,,,,1,VHD,,,0,thin

Please note, that this volume is now in Expunged state because I tried to delete it after the unsuccessful migration and it didn't worked.

Also, the following disk offering output is currently the following but there have been manual changes to the values is several fields. select * from disk_offering where uuid = "5b764ddb-ea60-40b1-8ff5-586953266e92";

"id","name","uuid","display_text","disk_size","type","tags","recreatable","use_local_storage","unique_name","system_use","customized","removed","created","sort_key","display_offering","customized_iops","min_iops","max_iops","bytes_read_rate","bytes_read_rate_max","bytes_read_rate_max_length","bytes_write_rate","bytes_write_rate_max","bytes_write_rate_max_length","iops_read_rate","iops_read_rate_max","iops_read_rate_max_length","iops_write_rate","iops_write_rate_max","iops_write_rate_max_length","state","hv_ss_reserve","cache_mode","provisioning_type"
22,SF DO 2 (2 GB) 2222-4444,"5b764ddb-ea60-40b1-8ff5-586953266e92",SF DO 2 (2 GB) 2222-4444,2147483648,Disk,sf,0,0,,0,0,,2022-03-03 08:01:30,0,1,0,2222,4444,,,,,,,,,,,,,Active,100,none,thin

However, I did tried similar volume migrations with the following volume and disk offering records. As you can see the volume size is 2GB, and the Disk Offering is if 12GB (I tried with 7GB too).

VOLUME

"id","account_id","domain_id","pool_id","last_pool_id","instance_id","device_id","name","uuid","size","folder","path","pod_id","data_center_id","iscsi_name","host_ip","volume_type","pool_type","disk_offering_id","template_id","first_snapshot_backup_uuid","recreatable","created","attached","updated","removed","state","chain_info","update_count","disk_type","vm_snapshot_chain_size","iso_id","display_volume","format","min_iops","max_iops","hv_ss_reserve","provisioning_type"
134,2,1,1,,131,1,DATAVOL-2,feab49ab-7381-4d55-a86a-6d2c51faa8dc,2147483648,,"03394dc4-d95a-47cc-8fb3-9950e92a2b44",,1,,,DATADISK,IscsiLUN,21,,,0,2022-03-03 12:32:59,2022-03-03 12:34:34,2022-03-03 12:41:32,,Ready,,8,,,,1,VHD,,,0,thin

DISK OFFERING

"id","name","uuid","display_text","disk_size","type","tags","recreatable","use_local_storage","unique_name","system_use","customized","removed","created","sort_key","display_offering","customized_iops","min_iops","max_iops","bytes_read_rate","bytes_read_rate_max","bytes_read_rate_max_length","bytes_write_rate","bytes_write_rate_max","bytes_write_rate_max_length","iops_read_rate","iops_read_rate_max","iops_read_rate_max_length","iops_write_rate","iops_write_rate_max","iops_write_rate_max_length","state","hv_ss_reserve","cache_mode","provisioning_type"
25,SF DO 12 GB,deea8a00-3024-4ccd-ae95-0df023c4e76d,SF DO 12 GB,12884901888,Disk,sf,0,0,,0,0,,2022-03-03 12:31:32,0,1,0,1111,2222,,,,,,,,,,,,,Active,200,none,thin
nvazquez commented 2 years ago

Thanks @tsinik-dw, unfortunately I cannot test this, but I do see the volume has hv_ss_reserve = 0, that may be the reason why its not reserving more space. Have you tried without specifying the new disk offering on the API method? Do you get the same error? For the sake of testing, can you manually update the hv_ss_reserve value for a test volume and attempt the migration again?

tsinik-dw commented 2 years ago

Hi @nvazquez,

it worked!... I made multiple tests and got several error messages which I am going to list at the end of this report. Please excuse my gigantic post :-) but the information may be helpful.

First, the steps that led to a successful volume migration.

  1. Create Compute Offering (CO 1) without tags
  2. Create VM using CO 1
  3. Alter CO 1 DB entry in disk_offering table (set min_iops, max_iops, hv_ss_reserve)
  4. Alter the VM's volume DB entry in the volumes table (set min_iops, max_iops, hv_ss_reserve) At this point UI doesn't give the Solidfire storage as an available option for volume migration, so
  5. in CMK I execute the following: migrate volume storageid=2514b65e-b231-4b2e-932c-c897f2df7c79 volumeid=83bc0063-b73c-4581-8bd0-c7e59d34263e livemigrate=true

and the volume gets migrated giving the following output;

{
  "volume": {
    "account": "admin",
    "created": "2022-03-09T10:14:51+0200",
    "destroyed": false,
    "deviceid": 0,
    "diskioread": 0,
    "diskiowrite": 0,
    "diskkbsread": 0,
    "diskkbswrite": 0,
    "displayvolume": true,
    "domain": "ROOT",
    "domainid": "747b2d33-8031-11ec-9ad0-ba21ccf13580",
    "hypervisor": "XenServer",
    "id": "83bc0063-b73c-4581-8bd0-c7e59d34263e",
    "isextractable": false,
    "maxiops": 4000,
    "miniops": 1000,
    "name": "ROOT-133",
    "path": "c71023ba-d338-4290-8b1c-8a314a662bc2",
    "provisioningtype": "thin",
    "quiescevm": false,
    "serviceofferingdisplaytext": "CO 1 Desc",
    "serviceofferingid": "a1c6f819-c653-406e-8a86-4b39ed4b744a",
    "serviceofferingname": "CO 1",
    "size": 5368709120,
    "state": "Ready",
    "storage": "SOLIDFIRE",
    "storageid": "2514b65e-b231-4b2e-932c-c897f2df7c79",
    "storagetype": "shared",
    "tags": [],
    "templatedisplaytext": "Centos 7",
    "templateid": "7bdc9edc-3afd-42e3-a23c-150ec4b58afa",
    "templatename": "Centos 7",
    "type": "ROOT",
    "virtualmachineid": "3e28031a-5019-4459-a19b-6fa0b73c4373",
    "vmdisplayname": "VM-XCP-2",
    "vmname": "VM-XCP-2",
    "vmstate": "Running",
    "zoneid": "97299276-7257-4b79-a1df-51cf89c402e2",
    "zonename": "ZONE1"
  }
}

For sake of completeness, I attach the corresponding management log entries nv_vol_migr_to_managed_cmk_success.txt

Now, some remarks and errors:

  1. If I use the newdiskofferingid option the migration does not work, even if the old and new disk offerings are exactly the same and with no tags. Error message:

(noc-dev) 🐱 > migrate volume storageid=2514b65e-b231-4b2e-932c-c897f2df7c79 volumeid=58a087e5-eab5-4588-acd2-b48d27895e8b livemigrate=true newdiskofferingid=b89a1d02-c775-470c-bd13-2a655d74cd49 { "accountid": "d0987ed7-8031-11ec-9ad0-ba21ccf13580", "cmd": "org.apache.cloudstack.api.command.admin.volume.MigrateVolumeCmdByAdmin", "completed": "2022-03-09T11:37:41+0200", "created": "2022-03-09T11:37:41+0200", "jobid": "3dc0fe30-8e9a-4911-8bcc-a428adda4fcf", "jobprocstatus": 0, "jobresult": { "errorcode": 431, "errortext": "The disk offering informed is not valid [id=b89a1d02-c775-470c-bd13-2a655d74cd49]." }, "jobresultcode": 530, "jobresulttype": "object", "jobstatus": 2, "userid": "d09ca276-8031-11ec-9ad0-ba21ccf13580" } πŸ™ˆ Error: async API failed for job 3dc0fe30-8e9a-4911-8bcc-a428adda4fcf


2. No matter the values and tag combinations in disk offerings, if the `newdiskofferingid` option is used the error message is the same as above
3.  If the initial compute offering uses a disk offering with a tag, then I have to manually add this tag to the managed storage in `stoprage_pool_tags`. Otherwise, I got the following:

(noc-dev) 🐱 > migrate volume storageid=2514b65e-b231-4b2e-932c-c897f2df7c79 volumeid=c925d774-1d1a-40b9-af62-ba21ad001f08 livemigrate=true { "accountid": "d0987ed7-8031-11ec-9ad0-ba21ccf13580", "cmd": "org.apache.cloudstack.api.command.admin.volume.MigrateVolumeCmdByAdmin", "completed": "2022-03-09T11:56:59+0200", "created": "2022-03-09T11:56:59+0200", "jobid": "b8535cbb-e996-452d-bd37-97e1057653b1", "jobprocstatus": 0, "jobresult": { "errorcode": 530, "errortext": "Migration target pool [null, tags:sf,solidfire] has no matching tags for volume [ROOT-135, uuid:c925d774-1d1a-40b9-af62-ba21ad001f08, tags:nfsPrimaryXCP]" }, "jobresultcode": 530, "jobresulttype": "object", "jobstatus": 2, "userid": "d09ca276-8031-11ec-9ad0-ba21ccf13580" } πŸ™ˆ Error: async API failed for job b8535cbb-e996-452d-bd37-97e1057653b1


I guess, in a production environment, a `proxy `compute offering must be used with an e.g. `migration` tag so as to not change the normal initial compute offering and add this same tag to the destination storage

4. Finally, after migration, I changed the Compute Offering to the correct one (with the appropriate tag) and everything went smooth.
nvazquez commented 2 years ago

Great @tsinik-dw - we're working on a fix for the volume hv_ss_reserve value but still not ready - once its produced would ask if you could test it. Been also checking the failures you shared when setting the newofferingid param, it seems like offering passed is removed?

tsinik-dw commented 2 years ago

Hi @nvazquez,

I would be happy to test the fix when ready.

Regarding the The disk offering informed is not valid error message, it was hard for me to trace the root cause, but I am sure that all the service offerings that are mentioned exist in my setup. Following is the management log of one and the apilog of several such messages during my tests.

nv_mgmtlog_migr_vol_offering_not_valid.txt nv_apilog_migr_vol_offering_not_valid.txt

rohityadavcloud commented 2 years ago

cc @pdion891 requires your input as it's related to xs/solidfire cc @shwstppr