Closed ceojinhak closed 5 years ago
Please follow https://netapp-trident.readthedocs.io/en/master/kubernetes/troubleshooting.html. I would pay close attention to the etcd logs.
Attached some detail logs as below. Can you give me some advice for me to resolve the issue?
[xadmop01@devrepo1 trident-installer]$ ./tridentctl logs -l all -n trident trident log: time="2018-09-10T05:47:10Z" level=info msg="Running Trident storage orchestrator." binary=/usr/local/bin/trident_orchestrator build_time="Mon Jul 30 21:46:22 UTC 2 018" version=18.07.0
etcd log: 2018-09-10 05:46:18.434421 I | etcdmain: etcd Version: 3.2.19 2018-09-10 05:46:18.434502 I | etcdmain: Git SHA: 8a9b3d538 2018-09-10 05:46:18.434512 I | etcdmain: Go Version: go1.8.7 2018-09-10 05:46:18.434517 I | etcdmain: Go OS/Arch: linux/amd64 2018-09-10 05:46:18.434521 I | etcdmain: setting maximum number of CPUs to 32, total number of available CPUs is 32 2018-09-10 05:46:18.436238 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2018-09-10 05:46:18.436389 I | embed: listening for peers on http://127.0.0.1:8002 2018-09-10 05:46:18.436439 I | embed: listening for client requests on 127.0.0.1:8001 2018-09-10 05:46:19.440214 W | etcdserver: another etcd process is using "/var/etcd/data/member/snap/db" and holds the file lock. 2018-09-10 05:46:19.440239 W | etcdserver: waiting for it to exit before starting... 2018-09-10 05:46:24.458489 C | mvcc/backend: cannot open database at /var/etcd/data/member/snap/db (no locks available) panic: cannot open database at /var/etcd/data/member/snap/db (no locks available)
goroutine 75 [running]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420169f20, 0xf8c075, 0x1f, 0xc42005ee68, 0x2, 0x2) /tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.newBackend(0xc420276aa0, 0x1d, 0x5f5e100, 0x2710, 0x280000000, 0x2) /tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:131 +0x1a7 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.New(0xc420276aa0, 0x1d, 0x5f5e100, 0x2710, 0x280000000, 0x0, 0x0) /tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:113 +0x48 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.newBackend(0xc42027c000, 0x8c35d9, 0x4325a8) /tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:36 +0x1b1 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.openBackend.func1(0xc4201cae40, 0xc42027c000) /tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:56 +0x2b created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.openBackend /tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:57 +0xa4
[xadmop01@devrepo1 trident-installer]$ ./tridentctl install -n trident -d
DEBU Initialized logging. logLevel=debug
DEBU Running outside a pod, creating CLI-based client.
DEBU Initialized Kubernetes CLI client. cli=kubectl flavor=k8s namespace=openstack version=1.10.4
DEBU Validated installation environment. installationNamespace=trident kubernetesVersion=
DEBU Parsed requested volume size. quantity=2Gi
DEBU Dumping RBAC fields. ucpBearerToken= ucpHost= useKubernetesRBAC=true
DEBU Namespace exists. namespace=trident
DEBU PVC does not exist. pvc=trident
DEBU PV does not exist. pv=trident
INFO Starting storage driver. backend=/home/xadmop01/trident-installer/setup/backend.json
DEBU config: {"backendName":"ontapnas_xxx.xxx.xxx.xxx","dataLIF":"xxx.xxx.xxx.xxx","managementLIF":"xxx.xxx.xxx.xxx","password":"xxxxx","storageDriverName":"onta p-nas","svm":"portal","username":"admin","version":1}
DEBU Storage prefix is absent, will use default prefix.
DEBU Parsed commonConfig: {Version:1 StorageDriverName:ontap-nas BackendName:ontapnas192.168.10.200 Debug:false DebugTraceFlags:map[] DisableDelete:false Storag ePrefixRaw:[] StoragePrefix:
We're aware of this issue, and if I'm not mistaken, I've already helped you or your account team determine etcd is the problem when a question was asked using our internal mailing list. My understanding is there is a support case open, so please be patient and wait for the process to go through. In the meantime, we'll update this issue once we have a solution. Thanks!
We have validated that the problem (cannot open database at /var/etcd/data/member/snap/db (no locks available)
) isn't related to Trident or Kubernetes as the etcd panic also happens when etcd is run as a binary with a manually mounted NFS share.
The problem seems to be with obtaining a lock on an NFS file and getting ENOLCK.
Any application that issues the flock system call may experience the same problem: https://linux.die.net/man/2/flock https://golang.org/pkg/syscall/#Flock
Our NFS experts are investigating this problem.
@ceojinhak To confirm NFS locking is the issue, you can follow these steps:
flock -n --verbose /mnt/nfs/myfile -c cat
). If you see errors, this confirms flock has failed. If flock fails, this confirms NFS locking is the issue.
Okay, I will try it tomorrow and let you know the result.
2018년 9월 19일 (수) 오전 12:57, Ardalan Kangarlou notifications@github.com님이 작성:
@ceojinhak https://github.com/ceojinhak To validate NFS locking is the issue, you can follow these steps:
- Manually create an NFS share.
- Manually mount the NFS share to the host.
- Create a file on this share.
- Use flock (man 1 flock) to obtain a lock on this file: (e.g., flock /mnt/nfs/myfile -c cat).
If flock fails, this confirms NFS locking is the issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NetApp/trident/issues/175#issuecomment-422448765, or mute the thread https://github.com/notifications/unsubscribe-auth/AoYGilTJLrsbrNeQmNU5vHVYxlRCVQxZks5ucRfngaJpZM4Wq5ix .
I just ran into this while trying to deploy trident to a different k8s cluster. If you have multiple installations of trident on the same netapp instance you will probably want to change igroupName
and storagePrefix
in addition to backendName
. Changing these allowed trident to be deployed successfully.
Thanks @nlowe. As the first comment indicates, this is an issue with our ontap-nas driver and with a new install, so the etcd volume doesn't exist already. However, you're right that the majority of the etcd problems are because users inadvertently share the same etcd volume between different instances of Trident. You can also run tridentctl installl --volume-name
to specify a different name for the etcd volume, but it's a good practice to use different storage prefixes for different instances of Trident (currently you should only deploy one Trident instance per k8s cluster).
The customer tested flock and sent the result as below.
flock: /mnt/test.txt: No locks available
Does that mean the issue caused by not trident & k8s but NFS protocol? Is it right?
2018년 9월 19일 (수) 오전 3:29, Ardalan Kangarlou notifications@github.com님이 작성:
Thanks @nlowe https://github.com/nlowe. As the first comment indicates, this is an issue with our ontap-nas driver and with a new install, so the etcd volume doesn't exist already. However, you're right that the majority of the etcd problems are because users inadvertently share the same etcd volume between different instances of Trident. You can also run tridentctl installl --volume-name to specify a different name for the etcd volume, but it's a good practice to use different storage prefixes for different instances of Trident (currently you should only deploy one Trident instance per k8s cluster).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NetApp/trident/issues/175#issuecomment-422499259, or mute the thread https://github.com/notifications/unsubscribe-auth/AoYGipfAN5NnN9CLFWjjEBOspaQkDoKTks5ucTtzgaJpZM4Wq5ix .
That's exactly what it means. Something with NFS isn't configured properly in your customer environment. For NFS locking to work,statd
should be running on your client hosts:
sudo systemctl enable rpc-statd # Enable statd on boot
sudo systemctl start rpc-statd # Start statd for the current session
The customer verified the activation of rpc-statd status on all nodes. By the way, he succeeded to install Trident after NFS v4 enable on FAS. He has installed another k8s cluster other than the problematic k8s cluster and confirmed that the trident was installed well. At this time, there was no changes in storage side. So, as you said, this issue is likely a problem with the nfs client (rpc-statd), and the reason for the successful installation when nfs v4 was enabled in a problematic environment is that rpc-statd is only monitor for nfs v2/v3. For the nfs v4, it seems to have changed in a different way. Finally, the customer and I decided to close this case to prevent time waste any more.
Many thanks for your help so far.
Glad to hear you figured it out! I'm adding some notes for the future reference of anyone who may encounter this problem.
Source: https://www.netapp.com/us/media/tr-4067.pdf
File locking mechanisms were created to prevent a file from being accessed for write operations by more than one user or application at a time. NFS leverages file locking either using the NLM process in NFSv3 or by leasing and locking, which is built in to the NFSv4.x protocols. Not all applications leverage file locking, however; for example, the application “vi” does not lock files. Instead, it uses a file swap method to save changes to a file. When an NFS client requests a lock, the client interacts with the clustered Data ONTAP system to save the lock state. Where the lock state is stored depends on the NFS version being used. In NFSv3, the lock state is stored at the data layer. In NFSv4.x, the lock states are stored in the NAS protocol stack. Use file locking using the NLM protocol when possible with NFSv3. Use NFSv4.x (4.1 if possible) when appropriate to take advantage of stateful connections, integrated locking, and session functionality.
Source: http://people.redhat.com/steved/Netapp_NFS_BestPractice.pdf Section 5.3: Network Lock Manager
Source: https://www.centos.org/docs/5/html/Deployment_Guide-en-US/s1-nfs-client-config-options.html
Common NFS Mount Options: nolock — Disables file locking. This setting is occasionally required when connecting to older NFS servers."
Source: https://www.centos.org/docs/5/html/Deployment_Guide-en-US/ch-nfs.html
NFSv4 has no interaction with portmapper, rpc.mountd, rpc.lockd, and rpc.statd, since protocol support has been incorporated into the v4 protocol. NFSv4 listens on the well known TCP port (2049) which eliminates the need for the portmapper interaction. The mounting and locking protocols have been incorpated into the V4 protocol which eliminates the need for interaction with rpc.mountd and rpc.lockd. rpc.lockd — allows NFS clients to lock files on the server. If rpc.lockd is not started, file locking will fail. rpc.lockd implements the Network Lock Manager (NLM) protocol. This process corresponds to the nfslock service. This is not used with NFSv4. rpc.statd — This process implements the Network Status Monitor (NSM) RPC protocol which notifies NFS clients when an NFS server is restarted without being gracefully brought down. This process is started automatically by the nfslock service and does not require user configuration. This is not used with NFSv4.
Source: https://wiki.wireshark.org/Network_Lock_Manager
The purpose of the NLM protocol is to provide something similar to POSIX advisory file locking semantics to NFS version 2 and 3. The lock manager is typically implemented completely inside user space in a lock manager daemon; that daemon will receive messages from the NFS client when a lock is requested, and will send NLM requests to the NLM server on the NFS server machine, and will receive NLM requests from NFS clients of the machine on which it's running and will make local locking calls on behalf of those clients. You need to run this lock manager daemon on BOTH the client and the server for lock management to work. Lock manager peers rely on the NSM protocol to notify each other of service restarts/reboots so that locks can be resynchronized after a reboot.
We should write this up in the troubleshooting section.
I just ran into this while trying to deploy trident to a different k8s cluster. If you have multiple installations of trident on the same netapp instance you will probably want to change
igroupName
andstoragePrefix
in addition tobackendName
. Changing these allowed trident to be deployed successfully.
Hello,
I'm running into a similar problem right now. I tried installing/uninstalling the latest trident driver several times. We used the same NetApp SVM before for tests with Openshift and want to reuse it for Kubernetes/Rancher. Is it possible to use the same SVM for more than one cluster? Or how or where can the attributes you mentioned be changed?
ok, I stumbled across a discussion regarding this issue on a NetApp slack channel. The "first" installation used a standard volume name for etcd storage on the SVM. When a second installation tries to create/use it, an error is raised. In this case a customized installation is necessary.
Trident 19.07 no longer uses the etcd volume, so this is no longer an issue.
I tried to install Trident v18.07 on K8s cluster with FAS(ontap-nas). During the installation, I've got the following error logs with the debug mode.
[xadmop01@devrepo1 trident-installer]$ ./tridentctl install -n trident INFO Trident pod started. namespace=trident pod=trident-797f547579-d572m INFO Waiting for Trident REST interface. ERRO Trident REST interface was not available after 180.00 seconds. FATA Install failed; exit status 1; Error: Get http://127.0.0.1:8000/trident/v1/version: dial tcp 127.0.0.1:8000: connect: connection refused command terminated with exit code 1; use 'tridentctl logs' to learn more. Resolve the issue; use 'tridentctl uninstall' to clean up; and try again.
No high latency between k8s nodes and FAS storage. Several re-installations were useless. What can be the cause of the issue?