LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
954 stars 76 forks source link

ZFS: create volumes with more than 8k blocksize #128

Closed ggzengel closed 2 years ago

ggzengel commented 4 years ago

https://github.com/LINBIT/linstor-server/blob/ddc51b7bd790236eab2d85c8bd4c9ee49d504ff9/satellite/src/main/java/com/linbit/linstor/storage/utils/ZfsCommands.java#L69

If you have a ZFS RAID pool with ashift=12 and more than 3 (+Parity) HDDs the blocksize should be more than 8k. Here is a thread which describes the problem: https://forum.proxmox.com/threads/zfs-replica-2x-larger-than-original.49801/

So please add -o volblocksize= while creating the volume. If you have x + parity HDDs then blocksize = 2 ^ floor(log2(x)) * 2 ^ ashift

If you have 16 disk with RAIDZ3 and ashift=12 => x=(16-3)=13 => floor(log2(13)) = 3 => blocksize = 2 ^ 3 * 2 ^ 12 => blocksize = 32k

ggzengel commented 4 years ago

Please can you add something like StorDriver/LvcreateOptions only for zfs create.

ghernadi commented 4 years ago

Please can you add something like StorDriver/LvcreateOptions only for zfs create.

Yes. StorDriver/ZfscreateOptions will be included in the next release.

However, even with this property you will have to manually calculate the wanted volblocksize and set the mentioned property accordingly.

ggzengel commented 4 years ago

Thanks. This will help a lot.

However, even with this property you will have to manually calculate the wanted volblocksize and set the mentioned property accordingly.

I have no problem with the calculation. Can you put it to the documentation for others?

ggzengel commented 4 years ago

I did a test with 32G VHD and I got:

'zfs create -V 33561640KB -o volblocksize=16k zpool1/proxmox/drbd/vm-107-disk-1_00000 Error message: cannot create 'zpool1/proxmox/drbd/vm-107-disk-1_00000': volume size must be a multiple of volume block size

Why do you add (33.561.640−32×2^20) = 7208k Bytes? Can you add 7.424k (mod 256k = 0) or 8192k (mod 1024k = 0)?

If you add bytes does this mean I can't use these volumes as native proxmox volumes on desaster recovery?

# rg lp zfs_12
┊ StorDriver/ZfscreateOptions ┊ -o volblocksize=16k ┊
ERROR REPORT 5F429B8D-F67D5-000005

============================================================

Application:                        LINBIT® LINSTOR
Module:                             Satellite
Version:                            1.8.0
Build ID:                           e56b6c2a80b6d000921a998e3ba4cd1102fbdd39
Build time:                         2020-08-17T13:02:52+00:00
Error time:                         2020-08-23 22:46:09
Node:                               px1.scr-wi.local

============================================================

Reported error:
===============

Description:
    Failed to create zfsvolume
Additional information:
    Command 'zfs create -V 33561640KB -o volblocksize=16k zpool1/proxmox/drbd/vm-107-disk-1_00000' returned with exitcode 1. 

    Standard out: 

    Error message: 
    cannot create 'zpool1/proxmox/drbd/vm-107-disk-1_00000': volume size must be a multiple of volume block size

Category:                           LinStorException
Class name:                         StorageException
Class canonical name:               com.linbit.linstor.storage.StorageException
Generated at:                       Method 'checkExitCode', Source file 'ExtCmdUtils.java', Line #69

Error message:                      Failed to create zfsvolume

Error context:
    An error occurred while processing resource 'Node: 'px1', Rsc: 'vm-107-disk-1''

Call backtrace:

    Method                                   Native Class:Line number
    checkExitCode                            N      com.linbit.extproc.ExtCmdUtils:69
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:104
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:64
    genericExecutor                          N      com.linbit.linstor.layer.storage.utils.Commands:52
    create                                   N      com.linbit.linstor.layer.storage.zfs.utils.ZfsCommands:86
    createLvImpl                             N      com.linbit.linstor.layer.storage.zfs.ZfsProvider:208
    createLvImpl                             N      com.linbit.linstor.layer.storage.zfs.ZfsProvider:61
    createVolumes                            N      com.linbit.linstor.layer.storage.AbsStorageProvider:387
    process                                  N      com.linbit.linstor.layer.storage.AbsStorageProvider:299
    process                                  N      com.linbit.linstor.layer.storage.StorageLayer:279
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:763
    processChild                             N      com.linbit.linstor.layer.drbd.DrbdLayer:448
    adjustDrbd                               N      com.linbit.linstor.layer.drbd.DrbdLayer:565
    process                                  N      com.linbit.linstor.layer.drbd.DrbdLayer:383
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:763
    processResourcesAndSnapshots             N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:309
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:145
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:258
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:896
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:618
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:535
    run                                      N      java.lang.Thread:834

END OF ERROR REPORT.
update VM 107: -scsi1 ZFS_DRBD_12:32
TASK ERROR: error during cfs-locked 'storage-ZFS_DRBD_12' operation: API Return-Code: 500. Message: Could not create resource definition vm-107-disk-1 from resource group zfs_12, because: [{"ret_code":20447233,"message":"Successfully set property key(s): StorDriver/ZfscreateOptions","obj_refs":{"RscDfn":"vm-107-disk-1","RscGrp":"zfs_12"}},{"ret_code":19922945,"message":"Volume definition with number '0' successfully  created in resource definition 'vm-107-disk-1'.","obj_refs":{"RscGrp":"zfs_12","RscDfn":"vm-107-disk-1","VlmNr":"0"}},{"ret_code":20447233,"message":"New resource definition 'vm-107-disk-1' created.","details":"Resource definition 'vm-107-disk-1' UUID is: 39e74123-008a-4a15-85e8-a4ab894e94ed","obj_refs":{"RscGrp":"zfs_12","UUID":"39e74123-008a-4a15-85e8-a4ab894e94ed","RscDfn":"vm-107-disk-1"}},{"ret_code":20185089,"message":"Successfully set property key(s): StorPoolName","obj_refs":{"RscDfn":"vm-107-disk-1","RscGrp":"zfs_12"}},{"ret_code":20185089,"message":"Successfully set property key(s): StorPoolName","obj_refs":{"RscDfn":"vm-107-disk-1","RscGrp":"zfs_12"}},{"ret_code":21233665,"message":"Resource 'vm-107-disk-1' successfully autoplaced on 2 nodes","details":"Used nodes (storage pool name): 'px1 (zfs_12)', 'px2 (zfs_12)'","obj_refs":{"RscDfn":"vm-107-disk-1","RscGrp":"zfs_12"}},{"ret_code":-4611686018406153242,"message":"(Node: 'px2') Failed to create zfsvolume","details":"Command 'zfs create -V 33561640KB -o volblocksize=16k zpool1/proxmox/drbd/vm-107-disk-1_00000' returned with exitcode 1. \n\nStandard out: \n\n\nError message: \ncannot create 'zpool1/proxmox/drbd/vm-107-disk-1_00000': volume size must be a multiple of volume block size\n\n","error_report_ids":["5F42A019-520BA-000000"],"obj_refs":{"RscDfn":"vm-107-disk-1","RscGrp":"zfs_12"}},{"ret_code":-4611686018406153242,"message":"(Node: 'px1') Failed to create zfsvolume","details":"Command 'zfs create -V 33561640KB -o volblocksize=16k zpool1/proxmox/drbd/vm-107-disk-1_00000' returned with exitcode 1. \n\nStandard out: \n\n\nError message: \ncannot create 'zpool1/proxmox/drbd/vm-107-disk-1_00000': volume size must be a multiple of volume block size\n\n","error_report_ids":["5F429B8D-F67D5-000000"],"obj_refs":{"RscDfn":"vm-107-disk-1","RscGrp":"zfs_12"}}]  at /usr/share/perl5/PVE/Storage/Custom/LINSTORPlugin.pm line 282.  PVE::Storage::Custom::LINSTORPlugin::alloc_image("PVE::Storage::Custom::LINSTORPlugin", "ZFS_DRBD_12", HASH(0x55f256945fd0), 107, "raw", undef, 33554432) called at /usr/share/perl5/PVE/Storage.pm line 824    eval {...} called at /usr/share/perl5/PVE/Storage.pm line 824   PVE::Storage::__ANON__() called at /usr/share/perl5/PVE/Cluster.pm line 614     eval {...} called at /usr/share/perl5/PVE/Cluster.pm line 582   PVE::Cluster::__ANON__("storage-ZFS_DRBD_12", undef, CODE(0x55f24f98f9c0)) called at /usr/share/perl5/PVE/Cluster.pm line 659   PVE::Cluster::cfs_lock_storage("ZFS_DRBD_12", undef, CODE(0x55f24f98f9c0)) called at /usr/share/perl5/PVE/Storage/Plugin.pm line 461    PVE::Storage::Plugin::cluster_lock_storage("PVE::Storage::Custom::LINSTORPlugin", "ZFS_DRBD_12", 1, undef, CODE(0x55f24f98f9c0)) called at /usr/share/perl5/PVE/Storage.pm line 829     PVE::Storage::vdisk_alloc(HASH(0x55f256939fa0), "ZFS_DRBD_12", 107, "raw", undef, 33554432) called at /usr/share/perl5/PVE/API2/Qemu.pm line 188    PVE::API2::Qemu::__ANON__("scsi1", HASH(0x55f2568b9d48)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 461  PVE::AbstractConfig::foreach_volume_full("PVE::QemuConfig", HASH(0x55f256a8c238), undef, CODE(0x55f24f98f6d8)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 470    PVE::AbstractConfig::foreach_volume("PVE::QemuConfig", HASH(0x55f256a8c238), CODE(0x55f24f98f6d8)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 221     eval {...} called at /usr/share/perl5/PVE/API2/Qemu.pm line 221     PVE::API2::Qemu::__ANON__(PVE::RPCEnvironment=HASH(0x55f2568adba0), "root\@pam", HASH(0x55f2568a8878), "x86_64", HASH(0x55f256939fa0), 107, undef, HASH(0x55f256a8c238)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 1269  PVE::API2::Qemu::__ANON__("UPID:px2:00003985:001F27D9:5F42EF9A:qmconfig:107:root\@pam:") called at /usr/share/perl5/PVE/RESTEnvironment.pm line 610     eval {...} called at /usr/share/perl5/PVE/RESTEnvironment.pm line 601   PVE::RESTEnvironment::fork_worker(PVE::RPCEnvironment=HASH(0x55f2568adba0), "qmconfig", 107, "root\@pam", CODE(0x55f256a915e8)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 1319   PVE::API2::Qemu::__ANON__() called at /usr/share/perl5/PVE/AbstractConfig.pm line 285   PVE::AbstractConfig::__ANON__() called at /usr/share/perl5/PVE/Tools.pm line 213    eval {...} called at /usr/share/perl5/PVE/Tools.pm line 213     PVE::Tools::lock_file_full("/var/lock/qemu-server/lock-107.conf", 10, 0, CODE(0x55f2568ae758)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 288    PVE::AbstractConfig::__ANON__("PVE::QemuConfig", 107, 10, 0, CODE(0x55f256a8c3b8)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 308    PVE::AbstractConfig::lock_config_full("PVE::QemuConfig", 107, 10, CODE(0x55f256a8c3b8)) called at /usr/share/perl5/PVE/AbstractConfig.pm line 316   PVE::AbstractConfig::lock_config("PVE::QemuConfig", 107, CODE(0x55f256a8c3b8)) called at /usr/share/perl5/PVE/API2/Qemu.pm line 1348    PVE::API2::Qemu::__ANON__(HASH(0x55f2568c0d40)) called at /usr/share/perl5/PVE/RESTHandler.pm line 453  PVE::RESTHandler::handle("PVE::API2::Qemu", HASH(0x55f254894970), HASH(0x55f2568c0d40)) called at /usr/share/perl5/PVE/HTTPServer.pm line 177   eval {...} called at /usr/share/perl5/PVE/HTTPServer.pm line 140    PVE::HTTPServer::rest_handler(PVE::HTTPServer=HASH(0x55f2568adc60), "172.19.36.101", "POST", "/nodes/px2/qemu/107/config", HASH(0x55f2568c0f38), HASH(0x55f25690e190), "extjs") called at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 746   eval {...} called at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 720    PVE::APIServer::AnyEvent::handle_api2_request(PVE::HTTPServer=HASH(0x55f2568adc60), HASH(0x55f256a4e660), HASH(0x55f2568c0f38), "POST", "/api2/extjs/nodes/px2/qemu/107/config") called at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 974  eval {...} called at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 966    PVE::APIServer::AnyEvent::handle_request(PVE::HTTPServer=HASH(0x55f2568adc60), HASH(0x55f256a4e660), HASH(0x55f2568c0f38), "POST", "/api2/extjs/nodes/px2/qemu/107/config") called at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1373  PVE::APIServer::AnyEvent::__ANON__(AnyEvent::Handle=HASH(0x55f256945c70), "scsi1=ZFS_DRBD_12%3A32&digest=0801d8a753fb783a9d5ba47413ede61"...) called at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Handle.pm line 1505   AnyEvent::Handle::__ANON__(AnyEvent::Handle=HASH(0x55f256945c70)) called at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Handle.pm line 1315   AnyEvent::Handle::_drain_rbuf(AnyEvent::Handle=HASH(0x55f256945c70)) called at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Handle.pm line 2015    AnyEvent::Handle::__ANON__(EV::IO=SCALAR(0x55f256a8c508), 1) called at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Impl/EV.pm line 88     eval {...} called at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent/Impl/EV.pm line 88   AnyEvent::CondVar::Base::_wait(AnyEvent::CondVar=HASH(0x55f255fd2ee0)) called at /usr/lib/x86_64-linux-gnu/perl5/5.28/AnyEvent.pm line 2026     AnyEvent::CondVar::Base::recv(AnyEvent::CondVar=HASH(0x55f255fd2ee0)) called at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1660    PVE::APIServer::AnyEvent::run(PVE::HTTPServer=HASH(0x55f2568adc60)) called at /usr/share/perl5/PVE/Service/pvedaemon.pm line 52     PVE::Service::pvedaemon::run(PVE::Service::pvedaemon=HASH(0x55f2568a8a28)) called at /usr/share/perl5/PVE/Daemon.pm line 171    eval {...} called at /usr/share/perl5/PVE/Daemon.pm line 171    PVE::Daemon::__ANON__(PVE::Service::pvedaemon=HASH(0x55f2568a8a28)) called at /usr/share/perl5/PVE/Daemon.pm line 391   eval {...} called at /usr/share/perl5/PVE/Daemon.pm line 380    PVE::Daemon::__ANON__(PVE::Service::pvedaemon=HASH(0x55f2568a8a28), undef) called at /usr/share/perl5/PVE/Daemon.pm line 552    eval {...} called at /usr/share/perl5/PVE/Daemon.pm line 550    PVE::Daemon::start(PVE::Service::pvedaemon=HASH(0x55f2568a8a28), undef) called at /usr/share/perl5/PVE/Daemon.pm line 661   PVE::Daemon::__ANON__(HASH(0x55f24f985fd0)) called at /usr/share/perl5/PVE/RESTHandler.pm line 453  PVE::RESTHandler::handle("PVE::Service::pvedaemon", HASH(0x55f2568a8d70), HASH(0x55f24f985fd0)) called at /usr/share/perl5/PVE/RESTHandler.pm line 865  eval {...} called at /usr/share/perl5/PVE/RESTHandler.pm line 848   PVE::RESTHandler::cli_handler("PVE::Service::pvedaemon", "pvedaemon start", "start", ARRAY(0x55f24fcba5d8), ARRAY(0x55f24f9a6050), undef, undef, undef) called at /usr/share/perl5/PVE/CLIHandler.pm line 591   PVE::CLIHandler::__ANON__(ARRAY(0x55f24f9861f8), CODE(0x55f24fd03108), undef) called at /usr/share/perl5/PVE/CLIHandler.pm line 668     PVE::CLIHandler::run_cli_handler("PVE::Service::pvedaemon", "prepare", CODE(0x55f24fd03108)) called at /usr/bin/pvedaemon line 27
ghernadi commented 4 years ago

For some reasons ZFS does not do the rounding of the volsize by itself, so Linstor has to do it. Doing so, I had to add a check for this new ZfscreateOptions property if it modifies the volblocksize, but by mistake I only added a check for -b but not for -o volblocksize=.

This will be fixed in the next release. Until then, please use -b 16K I'll reopen and leave this ticket open until verified as fixed.

If you add bytes does this mean I can't use these volumes as native proxmox volumes on desaster recovery?

I am not sure what do you mean by this?

ggzengel commented 4 years ago

If you add bytes does this mean I can't use these volumes as native proxmox volumes on desaster recovery?

I am not sure what do you mean by this?

I meant, if you increase the size you will do this for LINSTOR/DRBD metadata. Where do you put the metadata? If you put the metadata in front I can't use the zvol as native proxmox zvol.

If there is something with LINSTOR on upgrade or defect database or something else I can use the zvol native with zfs rename and patching the vm*.conf. It's like using RAID1 disks with plain SATA controllers, because the RAID controller writes the metadata at the end of the disks.

ghernadi commented 4 years ago

If DRBD is using internal metadata, DRBD writes them at the end of the device as stated in the docs

ggzengel commented 4 years ago

Thanks for the link. An other work around is using external metadata, because proxmox's volsize is always mod 1GB=0. Does this work with zfs or do you always increase volsize?

I normally use 2 Intel Optane as ZIL with underlying LVM. So I could use them as metadata store, too?

ghernadi commented 4 years ago

Does this work with zfs or do you always increase volsize?

That should work. In Linstor currently only DRBD with internal metadata and the Luks layer need additional space for metadata (although Luks constantly requires 16MB, which should be fine with ordinary blocksizes :) )

I normally use 2 Intel Optane as ZIL with underlying LVM. So I could use them as metadata store, too?

Yep, sounds like a good idea.

ggzengel commented 4 years ago

Now I did this with a workaround from #176 and LINBIT/linstor-client/issues/42:

sp c lvm px1 zfs_12_meta VG1
sp c lvm px2 zfs_12_meta VG1
linstor sp sp px1 zfs_12_meta StorDriver/LvcreateOptions "-m 1 VG1 /dev/nvme0n1 /dev/nvme1n1"
linstor sp sp px2 zfs_12_meta StorDriver/LvcreateOptions "-m 1 VG1 /dev/nvme0n1 /dev/nvme1n1"
rg sp zfs_12 StorPoolNameDrbdMeta zfs_12_meta
rg sp zfs_12 DrbdMetaType external
linstor rg sp zfs_12 StorDriver/ZfscreateOptions "-o volblocksize=16k"

Here is the reward: Saved 50% space and doubled the speed, because ZFS used only one half of the vdev (ZFS is strange on this).

# zfs list zpool1/proxmox/drbd/vm-107-disk-1_00000 zpool1/proxmox/drbd/vm-107-disk-2_00000 -o name,volblocksize,used,volsize,refreservation,usedbyrefreservation
NAME                                     VOLBLOCK   USED  VOLSIZE  REFRESERV  USEDREFRESERV
zpool1/proxmox/drbd/vm-107-disk-1_00000       16K  37.1G      32G      37.1G          37.1G
zpool1/proxmox/drbd/vm-107-disk-2_00000        8K  74.1G      32G      74.1G          74.1G
# pvs -o +lv_name | grep 107
  /dev/nvme0n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-1.meta_00000_rmeta_0] 
  /dev/nvme0n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-1.meta_00000_rimage_0]
  /dev/nvme0n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-2.meta_00000_rmeta_0] 
  /dev/nvme0n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-2.meta_00000_rimage_0]
  /dev/nvme1n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-1.meta_00000_rmeta_1] 
  /dev/nvme1n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-1.meta_00000_rimage_1]
  /dev/nvme1n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-2.meta_00000_rmeta_1] 
  /dev/nvme1n1 VG1 lvm2 a--  <372.61g <317.59g [vm-107-disk-2.meta_00000_rimage_1]

Does somebody from Proxmox (@fabian-gruenbichler?) put this to Proxmox docs for other people?

Fabian-Gruenbichler commented 4 years ago

Does somebody from Proxmox (@Fabian-Gruenbichler?) put this to Proxmox docs for other people?

we have a (not-yet-updated and thus not-yet-merged) patch for our docs for the general 'raidz + zvol => high space usage overhead with default settings' issue, which we will include at some point in our reference documentation. I don't think we'll add linstore specific hints to our documentation, as that integration and plugin is not developed by us.

rp- commented 2 years ago

the zfs block size can be specified with the following property setter: linstor storage-pool set-property <node> <pool> StorDriver/ZfscreateOptions "-b 32k"