apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.06k stars 1.1k forks source link

Mounting NFS 4 shares of secondary storage fails on CS Hosts #5491

Closed Hudratronium closed 2 years ago

Hudratronium commented 3 years ago
ISSUE TYPE
COMPONENT NAME
Cloudstack-Agent
Secondary-Storage
NFS
CLOUDSTACK VERSION
4.15.2 (Others not tested)
CONFIGURATION

Advanced Network Sepparated storage network

OS / ENVIRONMENT
SUMMARY

Deploying a new instance fails due to error "failed to create netfs mount" when mounting the iso / templete from secondary storage for installation of instance.

STEPS TO REPRODUCE
Cloudstack environment with NFSv 4.1 / 4 "only"
Deploy instance
EXPECTED RESULTS
- Host mounts nfs-share for providing installation media to new guest / created instance for installation
ACTUAL RESULTS
`1-09-21 22:48:11,198 ERROR [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:6a9a5a89) org.libvirt.LibvirtException: internal error: Child process (/bin/mount 172.17.3.6:/volume3/secondary_storage/template/tmpl/6/216 /mnt/4ca9f81b-7a76-326f-9f2d-e64b5a5bee99 -o nodev,nosuid,noexec) unexpected exit status 32: mount.nfs: Connection timed out

2021-09-21 22:48:11,199 ERROR [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:6a9a5a89) Failed to create netfs mount: 172.17.3.6:/volume3/secondary_storage/template/tmpl/6/216
org.libvirt.LibvirtException: internal error: Child process (/bin/mount 172.17.3.6:/volume3/secondary_storage/template/tmpl/6/216 /mnt/4ca9f81b-7a76-326f-9f2d-e64b5a5bee99 -o nodev,nosuid,noexec) unexpected exit status 32: mount.nfs: Connection timed out

        at org.libvirt.ErrorHandler.processError(Unknown Source)
        at org.libvirt.ErrorHandler.processError(Unknown Source)
        at org.libvirt.Connect.storagePoolCreateXML(Unknown Source)
        at com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.createNetfsStoragePool(LibvirtStorageAdaptor.java:255)
        at com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.createStoragePool(LibvirtStorageAdaptor.java:621)
        at com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:329)
        at com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.getStoragePoolByURI(KVMStoragePoolManager.java:284)
        at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.getPhysicalDiskFromNfsStore(LibvirtComputingResource.java:2686)
        at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.createVbd(LibvirtComputingResource.java:2512)
        at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtStartCommandWrapper.execute(LibvirtStartCommandWrapper.java:74)
        at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtStartCommandWrapper.execute(LibvirtStartCommandWrapper.java:45)
        at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtRequestWrapper.execute(LibvirtRequestWrapper.java:78)
        at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1648)
        at com.cloud.agent.Agent.processRequest(Agent.java:661)
        at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:1079)
        at com.cloud.utils.nio.Task.call(Task.java:83)
        at com.cloud.utils.nio.Task.call(Task.java:29)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
2021-09-21 22:48:11,202 ERROR [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:6a9a5a89) [Ljava.lang.StackTraceElement;@103a5f3d`

Executing the mount manually with -v option results in:

mount -v 172.17.3.6:/volume3/secondary_storage/template/tmpl/6/216 /test -o nodev,nosuid,noexec
mount.nfs: timeout set for Tue Sep 21 23:01:23 2021
mount.nfs: trying text-based options 'vers=4.2,addr=172.17.3.6,clientaddr=172.17.3.4'
mount.nfs: mount(2): Protocol not supported
mount.nfs: trying text-based options 'vers=4.1,addr=172.17.3.6,clientaddr=172.17.3.4'
mount.nfs: mount(2): No such file or directory
mount.nfs: trying text-based options 'addr=172.17.3.6'
mount.nfs: prog 100003, trying vers=3, prot=6

Manually adding option for using nfs 4

mount -v 172.17.3.6:/volume3/secondary_storage/template/tmpl/6/216 /test -o nodev,nosuid,noexec,vers=4
mount.nfs: timeout set for Tue Sep 21 23:02:50 2021
mount.nfs: trying text-based options 'vers=4,addr=172.17.3.6,clientaddr=172.17.3.4`

result is a mounted share

172.17.3.6:/volume3/secondary_storage/template/tmpl/6/216 on /test type nfs4 (rw,nosuid,nodev,noexec,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.17.3.4,local_lock=none,addr=172.17.3.6)

Before joining of the host to an cloudstack cluster even with installed cloudstack-agent packages, i tested the nfs-server with the nfs-common package. the share was mountable with the "expected" results. After joining the CS cluster, the share was only accessable when explicitly giving the option vers=4 as descripted above. The SSVM is working fine currently (at least at this point i was able to upload several images)

This is for me also reproduceable with any other share (not CS related) which i provide to the server.

rohityadavcloud commented 3 years ago

cc @nvazquez @sureshanaparti pl triage

nvazquez commented 3 years ago

I've checked the LibvirtStoragePoolDef class needs to be extended by setting the protocol filed on the source element to the specified version, not honoured for KVM (https://libvirt.org/formatstorage.html#StoragePoolSource)

kricud commented 3 years ago

@Hudratronium How you are handling “no_root_squash” with nfsv4 What are user and group on nfsv4 mount? “ls -la” If you force it globally “secstorage.nfs.version” how you are handling “/etc/idmapd.conf” for nfsv4 on secondarystoragevm? Can you show “ls -la” on mount from secondarystoragevm?

Outcome for permissions is root:root or nobady:nobady ?

Hudratronium commented 3 years ago

@kricud tbh i don't get your question regarding the NFSv4 / no_root_squash. I would say this is more a problem of the used nfs-server / client then the cloudstack agent. I provide the option in my /etc/exports for the shares.

user:group are root:root

Sadly i can't take a look into the setup as i have some problems with libvirt currently, preventing my system VMs to start. But i got your point. I assume while deploying the SSVM there are some scripts checking either the environment / hypervisor and according to his domain-setting is setting the domain-value in idmap.conf.

kricud commented 3 years ago

@Hudratronium loot of assumptions on what you have. Remove global setting “secstorage.nfs.version” leave it blank. On KVM host newest possible version will be used supported by server nfs4.1. Confirm that KVM host and NFS host have same domain name if no in /etc/idmapd.conf fix domain name on ACS/KVM so that your NFS mounts with root:root permissions. (NFSv4 works differently than NFSv3 more in link) https://man7.org/linux/man-pages/man5/nfsidmap.5.html

For SSVM Infrastructure > Secondary Storage > click on storage > Settings > Click on Pencil/Edit type “3” > press OK. > destroy SSVM

Everywhere except SSVM you will have NFSv4 and for SSVM you will have NFSv3 as you can’t set idmap persistently for domain on it.

If you have doomsday disable NFSv4 on server fix ACS and then enable it. NFSv3 will work as clock.

Probably there was case way “secstorage.nfs.version” was introduced as global setting. I cant come up with anything decent today.

Your issue is bug but why would you want to use it this way. I can’t answer to it. (enforcement, compliance)

Hudratronium commented 3 years ago

@kricud totally true - if you have further sources for getting into this topic i am more then willing to read into it (working principals of ssvm). Thanks for your advice for setting everything up with nfs3

The reasoning isn't as much to use a specific nfs version across ACS specific components (ssvm / agent / management-server) more the reduction of the needed open ports in firewalls / nfs-servers as well the usage of kerberos authentication later on. --> complience and regulations

kricud commented 3 years ago

Regarding ssvm http://docs.cloudstack.apache.org/en/latest/adminguide/systemvm.html#secondary-storage-vm

According to what I know there is no nice/persistent way to set nfsv4 idmap domain for ssvm. Best corse of action now is to use nfsv3 on ssvm. All rest if configured properly will work with nfsv4 and you will get all benefits. You don’t need to change anything in ACS nfs clients on host/acs-components will negotiate latest and grates that is supported by nfs server.

Compliance can be achieved with export policy on nfs server- that is prohibiting nfsv3- but later :) when you have confirmed that it works as expected.

nvazquez commented 2 years ago

@Hudratronium do you get the same error if not specifying a value (empty value) for the global setting secstorage.nfs.version? I realized in my setup the global setting has an empty value and the secondary storage is mounted with version 4 by default.

root@s-21-VM:~# mount | grep nfs
192.168.1.12:/export/secondary on /mnt/SecStorage/dac777a0-4f45-3dca-9a0d-de4e5254d38e type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acdirmin=0,acdirmax=0,soft,proto=tcp,timeo=133,retrans=2147483647,sec=sys,clientaddr=192.168.1.61,local_lock=none,addr=192.168.1.12)
Hudratronium commented 2 years ago

@nvazquez That is quiet some time ago - but yes. I remember that i initially tested it without setting a value in the global configuration - without success. I have to admitt that since that time and "gradeing" down to nfs3 on the storage server i haven't looked into it further.

however - you are checking the mount form inside of your ssvm, correct?

It seems i discribed it a bit confusing: The host with the cloudstack-agent installed is / wasn't able to mount the share ("172.17.3.6:/volume3/secondary_storage/template/tmpl/6/216") from the storage aka NFS-Server. I had no problems with the SSVM itself mounting the share as far as i am aware.

From what i read i got the impression, that the 'secstorage.nfs.version' would also be used to configure the mount-commands of the hosts. Thats why i started trying the option.

nvazquez commented 2 years ago

@Hudratronium sorry for the delay. You are right, I have now checked from a KVM host after attaching an ISO to a VM and also see it mounted as nfs4

192.168.1.9:/export/secondary/template/tmpl/7/208 on /mnt/f593d3f8-2f00-3ae7-9bb5-16da4762d428 type nfs4 (rw,nosuid,nodev,noexec,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.9,local_lock=none,addr=192.168.1.9)

Will check further on other environments

Hudratronium commented 2 years ago

@nvazquez Thanks for all your testing! Currently i only have a nfs4.1 server availeable - which seems to work as expected. So it seems for me the problem is solved - either via the 4.16 Agent or some updated resourccess form ubuntu. Maybe for your testing - as i had some issues with this: When you deploy a 4.1nfs server in your setup is the mount-command still working without specifying a specific version?

nvazquez commented 2 years ago

Great to know @Hudratronium, in my setup the mount command works without specifying a version, however it picks up version 4.2 as the default version and the command succeeds

# mount -v 192.168.1.9:/export/secondary/template/tmpl/7/208 /test/ -o nodev,nosuid,noexec
mount.nfs: timeout set for Tue Jan 18 22:52:39 2022
mount.nfs: trying text-based options 'vers=4.2,addr=192.168.1.9,clientaddr=192.168.1.10'

Also tried with the specific version 4.1 and also succeeds:

# mount -v 192.168.1.9:/export/secondary/template/tmpl/7/208 /test/ -o nodev,nosuid,noexec,vers=4.1
mount.nfs: timeout set for Tue Jan 18 23:18:33 2022
mount.nfs: trying text-based options 'vers=4.1,addr=192.168.1.9,clientaddr=192.168.1.10'
# mount | grep 208
192.168.1.9:/export/secondary/template/tmpl/7/208 on /test type nfs4 (rw,nosuid,nodev,noexec,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.10,local_lock=none,addr=192.168.1.9)

One additional question: the original issue was reported on 4.15.2, is your NFS server now running properly on the same CloudStack version or have upgraded it to 4.16.0?

Hudratronium commented 2 years ago

@nvazquez I upgraded to 4.16 several weeks ago. So i can't say if my original problem is solved under 4.15.2 (sadly).

sureshanaparti commented 2 years ago

@nvazquez I upgraded to 4.16 several weeks ago. So i can't say if my original problem is solved under 4.15.2 (sadly).

@Hudratronium So, you couldn't reproduce this issue in 4.16.0 (after upgrade)?

Hudratronium commented 2 years ago

@sureshanaparti yes - i couldn't reproduce this after the update.

sureshanaparti commented 2 years ago

thanks for the confirmation @Hudratronium

nvazquez commented 2 years ago

Thanks @Hudratronium closing the issue, please reopen it in case it is hit again