ceojinhak commented 6 years ago

I tried to install Trident v18.07 on K8s cluster with FAS(ontap-nas). During the installation, I've got the following error logs with the debug mode.

[xadmop01@devrepo1 trident-installer]$ ./tridentctl install -n trident INFO Trident pod started. namespace=trident pod=trident-797f547579-d572m INFO Waiting for Trident REST interface. ERRO Trident REST interface was not available after 180.00 seconds. FATA Install failed; exit status 1; Error: Get http://127.0.0.1:8000/trident/v1/version: dial tcp 127.0.0.1:8000: connect: connection refused command terminated with exit code 1; use 'tridentctl logs' to learn more. Resolve the issue; use 'tridentctl uninstall' to clean up; and try again.

No high latency between k8s nodes and FAS storage. Several re-installations were useless. What can be the cause of the issue?

kangarlou commented 6 years ago

Please follow https://netapp-trident.readthedocs.io/en/master/kubernetes/troubleshooting.html. I would pay close attention to the etcd logs.

ceojinhak commented 6 years ago

Attached some detail logs as below. Can you give me some advice for me to resolve the issue?

[xadmop01@devrepo1 trident-installer]$ ./tridentctl logs -l all -n trident trident log: time="2018-09-10T05:47:10Z" level=info msg="Running Trident storage orchestrator." binary=/usr/local/bin/trident_orchestrator build_time="Mon Jul 30 21:46:22 UTC 2 018" version=18.07.0

etcd log: 2018-09-10 05:46:18.434421 I | etcdmain: etcd Version: 3.2.19 2018-09-10 05:46:18.434502 I | etcdmain: Git SHA: 8a9b3d538 2018-09-10 05:46:18.434512 I | etcdmain: Go Version: go1.8.7 2018-09-10 05:46:18.434517 I | etcdmain: Go OS/Arch: linux/amd64 2018-09-10 05:46:18.434521 I | etcdmain: setting maximum number of CPUs to 32, total number of available CPUs is 32 2018-09-10 05:46:18.436238 N | etcdmain: the server is already initialized as member before, starting as etcd member... 2018-09-10 05:46:18.436389 I | embed: listening for peers on http://127.0.0.1:8002 2018-09-10 05:46:18.436439 I | embed: listening for client requests on 127.0.0.1:8001 2018-09-10 05:46:19.440214 W | etcdserver: another etcd process is using "/var/etcd/data/member/snap/db" and holds the file lock. 2018-09-10 05:46:19.440239 W | etcdserver: waiting for it to exit before starting... 2018-09-10 05:46:24.458489 C | mvcc/backend: cannot open database at /var/etcd/data/member/snap/db (no locks available) panic: cannot open database at /var/etcd/data/member/snap/db (no locks available)

goroutine 75 [running]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420169f20, 0xf8c075, 0x1f, 0xc42005ee68, 0x2, 0x2) /tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x15c github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.newBackend(0xc420276aa0, 0x1d, 0x5f5e100, 0x2710, 0x280000000, 0x2) /tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:131 +0x1a7 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.New(0xc420276aa0, 0x1d, 0x5f5e100, 0x2710, 0x280000000, 0x0, 0x0) /tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/backend.go:113 +0x48 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.newBackend(0xc42027c000, 0x8c35d9, 0x4325a8) /tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:36 +0x1b1 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.openBackend.func1(0xc4201cae40, 0xc42027c000) /tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:56 +0x2b created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.openBackend /tmp/etcd-release-3.2.19/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/backend.go:57 +0xa4

[xadmop01@devrepo1 trident-installer]$ ./tridentctl install -n trident -d DEBU Initialized logging. logLevel=debug DEBU Running outside a pod, creating CLI-based client. DEBU Initialized Kubernetes CLI client. cli=kubectl flavor=k8s namespace=openstack version=1.10.4 DEBU Validated installation environment. installationNamespace=trident kubernetesVersion= DEBU Parsed requested volume size. quantity=2Gi DEBU Dumping RBAC fields. ucpBearerToken= ucpHost= useKubernetesRBAC=true DEBU Namespace exists. namespace=trident DEBU PVC does not exist. pvc=trident DEBU PV does not exist. pv=trident INFO Starting storage driver. backend=/home/xadmop01/trident-installer/setup/backend.json DEBU config: {"backendName":"ontapnas_xxx.xxx.xxx.xxx","dataLIF":"xxx.xxx.xxx.xxx","managementLIF":"xxx.xxx.xxx.xxx","password":"xxxxx","storageDriverName":"onta p-nas","svm":"portal","username":"admin","version":1} DEBU Storage prefix is absent, will use default prefix. DEBU Parsed commonConfig: {Version:1 StorageDriverName:ontap-nas BackendName:ontapnas192.168.10.200 Debug:false DebugTraceFlags:map[] DisableDelete:false Storag ePrefixRaw:[] StoragePrefix: SerialNumbers:[] DriverContext:} DEBU Initializing storage driver. driver=ontap-nas DEBU Addresses found from ManagementLIF lookup. addresses="[xxx.xxx.xxx.xxx]" hostname=xxx.xxx.xxx.xxx DEBU Using specified SVM. SVM=portal DEBU ONTAP API version. Ontapi=1.140 DEBU Read serial numbers. Count=2 SerialNumbers="451436000030,451436000031" INFO Controller serial numbers. serialNumbers="451436000030,451436000031" DEBU Configuration defaults Encryption=false ExportPolicy=default FileSystemType=ext4 NfsMountOptions="-o nfsvers=3" SecurityStyle=unix Si ze=1G SnapshotDir=false SnapshotPolicy=none SpaceReserve=none SplitOnClone=false StoragePrefix=trident UnixPermissions=---rwxrwxrwx DEBU Data LIFs dataLIFs="[xxx.xxx.xxx.xxx]" DEBU Found NAS LIFs. dataLIFs="[xxx.xxx.xxx.xxx]" DEBU Addresses found from hostname lookup. addresses="[xxx.xxx.xxx.xxx]" hostname=xxx.xxx.xxx.xxx DEBU Found matching Data LIF. hostNameAddress=xxx.xxx.xxx.xxx DEBU Configured EMS heartbeat. intervalHours=24 DEBU Read storage pools assigned to SVM. pools="[n1_aggr1]" svm=portal DEBU Read aggregate attributes. aggregate=n1_aggr1 mediaType=ssd DEBU Storage driver initialized. driver=ontap-nas INFO Storage driver loaded. driver=ontap-nas INFO Starting Trident installation. namespace=trident DEBU Deleted Kubernetes object by YAML. DEBU Deleted cluster role binding. DEBU Deleted Kubernetes object by YAML. DEBU Deleted cluster role. DEBU Deleted Kubernetes object by YAML. DEBU Deleted service account. DEBU Created Kubernetes object by YAML. INFO Created service account. DEBU Created Kubernetes object by YAML. INFO Created cluster role. DEBU Created Kubernetes object by YAML. INFO Created cluster role binding. DEBU Created Kubernetes object by YAML. INFO Created PVC. DEBU Attempting volume create. size=2147483648 storagePool=n1_aggr1 volConfig.StorageClass= DEBU Creating Flexvol. aggregate=n1_aggr1 encryption=false exportPolicy=default name=trident_trident securityStyle=unix size=21474836 48 snapshotDir=false snapshotPolicy=none snapshotReserve=0 spaceReserve=none unixPermissions=---rwxrwxrwx DEBU SVM root volume has no load-sharing mirrors. rootVolume=portal_root DEBU Created Kubernetes object by YAML. INFO Created PV. pv=trident INFO Waiting for PVC to be bound. pvc=trident DEBU PVC not yet bound, waiting. increment=437.894321ms pvc=trident DEBU PVC not yet bound, waiting. increment=744.425557ms pvc=trident DEBU PVC not yet bound, waiting. increment=693.54601ms pvc=trident DEBU PVC not yet bound, waiting. increment=1.517944036s pvc=trident DEBU PVC not yet bound, waiting. increment=2.286463684s pvc=trident DEBU PVC not yet bound, waiting. increment=2.656384976s pvc=trident DEBU Logged EMS message. driver=ontap-nas DEBU Created Kubernetes object by YAML. INFO Created Trident deployment. INFO Waiting for Trident pod to start. DEBU Trident pod not yet running, waiting. increment=556.890976ms DEBU Trident pod not yet running, waiting. increment=826.778894ms DEBU Trident pod not yet running, waiting. increment=1.540550114s DEBU Trident pod not yet running, waiting. increment=1.856453902s DEBU Trident pod not yet running, waiting. increment=2.484890356s DEBU Trident pod not yet running, waiting. increment=2.818583702s DEBU Trident pod not yet running, waiting. increment=3.26384762s DEBU Trident pod not yet running, waiting. increment=12.061166391s INFO Trident pod started. namespace=trident pod=trident-678586db49-fsdpm INFO Waiting for Trident REST interface. DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=734.140902ms DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=442.993972ms DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=929.671236ms DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=2.345275362s DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=2.031313228s DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=4.000982326s DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=8.222916418s DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=4.755251834s DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=19.045514652s DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=11.815640826s DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=28.062204043s DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=1m3.739392425s DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json DEBU REST interface not yet up, waiting. increment=1m16.513543682s DEBU Invoking tunneled command: kubectl exec trident-678586db49-fsdpm -n trident -c trident-main -- tridentctl -s 127.0.0.1:8000 version -o json ERRO Trident REST interface was not available after 180.00 seconds. FATA Install failed; exit status 1; Error: Get http://127.0.0.1:8000/trident/v1/version: dial tcp 127.0.0.1:8000: connect: connection refused command terminated with exit code 1; use 'tridentctl logs' to learn more. Resolve the issue; use 'tridentctl uninstall' to clean up; and try again.

kangarlou commented 6 years ago

We're aware of this issue, and if I'm not mistaken, I've already helped you or your account team determine etcd is the problem when a question was asked using our internal mailing list. My understanding is there is a support case open, so please be patient and wait for the process to go through. In the meantime, we'll update this issue once we have a solution. Thanks!

kangarlou commented 6 years ago

We have validated that the problem (cannot open database at /var/etcd/data/member/snap/db (no locks available)) isn't related to Trident or Kubernetes as the etcd panic also happens when etcd is run as a binary with a manually mounted NFS share.

The problem seems to be with obtaining a lock on an NFS file and getting ENOLCK.

Any application that issues the flock system call may experience the same problem: https://linux.die.net/man/2/flock https://golang.org/pkg/syscall/#Flock

Our NFS experts are investigating this problem.

kangarlou commented 6 years ago

@ceojinhak To confirm NFS locking is the issue, you can follow these steps:

Manually create an NFS share.
Manually mount the NFS share to the host.
Create a file on this share.
Use flock (man 1 flock) to obtain a lock on this file (e.g., flock -n --verbose /mnt/nfs/myfile -c cat). If you see errors, this confirms flock has failed.

If flock fails, this confirms NFS locking is the issue.

ceojinhak commented 6 years ago

Okay, I will try it tomorrow and let you know the result.

2018년 9월 19일 (수) 오전 12:57, Ardalan Kangarlou notifications@github.com님이 작성:

@ceojinhak https://github.com/ceojinhak To validate NFS locking is the issue, you can follow these steps:

Manually create an NFS share.

Manually mount the NFS share to the host.

Create a file on this share.

Use flock (man 1 flock) to obtain a lock on this file: (e.g., flock /mnt/nfs/myfile -c cat).

If flock fails, this confirms NFS locking is the issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NetApp/trident/issues/175#issuecomment-422448765, or mute the thread https://github.com/notifications/unsubscribe-auth/AoYGilTJLrsbrNeQmNU5vHVYxlRCVQxZks5ucRfngaJpZM4Wq5ix .

nlowe commented 6 years ago

I just ran into this while trying to deploy trident to a different k8s cluster. If you have multiple installations of trident on the same netapp instance you will probably want to change igroupName and storagePrefix in addition to backendName. Changing these allowed trident to be deployed successfully.

kangarlou commented 6 years ago

Thanks @nlowe. As the first comment indicates, this is an issue with our ontap-nas driver and with a new install, so the etcd volume doesn't exist already. However, you're right that the majority of the etcd problems are because users inadvertently share the same etcd volume between different instances of Trident. You can also run tridentctl installl --volume-name to specify a different name for the etcd volume, but it's a good practice to use different storage prefixes for different instances of Trident (currently you should only deploy one Trident instance per k8s cluster).

ceojinhak commented 6 years ago

The customer tested flock and sent the result as below.

mount xxx.xxx.xxx.xxx:/vol001 /mnt

touch /mnt/test.txt

flock /mnt/test.txt -c cat

flock: /mnt/test.txt: No locks available

Does that mean the issue caused by not trident & k8s but NFS protocol? Is it right?

2018년 9월 19일 (수) 오전 3:29, Ardalan Kangarlou notifications@github.com님이 작성:

Thanks @nlowe https://github.com/nlowe. As the first comment indicates, this is an issue with our ontap-nas driver and with a new install, so the etcd volume doesn't exist already. However, you're right that the majority of the etcd problems are because users inadvertently share the same etcd volume between different instances of Trident. You can also run tridentctl installl --volume-name to specify a different name for the etcd volume, but it's a good practice to use different storage prefixes for different instances of Trident (currently you should only deploy one Trident instance per k8s cluster).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NetApp/trident/issues/175#issuecomment-422499259, or mute the thread https://github.com/notifications/unsubscribe-auth/AoYGipfAN5NnN9CLFWjjEBOspaQkDoKTks5ucTtzgaJpZM4Wq5ix .

kangarlou commented 6 years ago

That's exactly what it means. Something with NFS isn't configured properly in your customer environment. For NFS locking to work,statd should be running on your client hosts:

sudo systemctl enable rpc-statd  # Enable statd on boot
sudo systemctl start rpc-statd   # Start statd for the current session

ceojinhak commented 6 years ago

The customer verified the activation of rpc-statd status on all nodes. By the way, he succeeded to install Trident after NFS v4 enable on FAS. He has installed another k8s cluster other than the problematic k8s cluster and confirmed that the trident was installed well. At this time, there was no changes in storage side. So, as you said, this issue is likely a problem with the nfs client (rpc-statd), and the reason for the successful installation when nfs v4 was enabled in a problematic environment is that rpc-statd is only monitor for nfs v2/v3. For the nfs v4, it seems to have changed in a different way. Finally, the customer and I decided to close this case to prevent time waste any more.

Many thanks for your help so far.

kangarlou commented 6 years ago

Glad to hear you figured it out! I'm adding some notes for the future reference of anyone who may encounter this problem.

Source: https://www.netapp.com/us/media/tr-4067.pdf

File locking mechanisms were created to prevent a file from being accessed for write operations by more than one user or application at a time. NFS leverages file locking either using the NLM process in NFSv3 or by leasing and locking, which is built in to the NFSv4.x protocols. Not all applications leverage file locking, however; for example, the application “vi” does not lock files. Instead, it uses a file swap method to save changes to a file. When an NFS client requests a lock, the client interacts with the clustered Data ONTAP system to save the lock state. Where the lock state is stored depends on the NFS version being used. In NFSv3, the lock state is stored at the data layer. In NFSv4.x, the lock states are stored in the NAS protocol stack. Use file locking using the NLM protocol when possible with NFSv3. Use NFSv4.x (4.1 if possible) when appropriate to take advantage of stateful connections, integrated locking, and session functionality.

Source: http://people.redhat.com/steved/Netapp_NFS_BestPractice.pdf Section 5.3: Network Lock Manager

Source: https://www.centos.org/docs/5/html/Deployment_Guide-en-US/s1-nfs-client-config-options.html

Common NFS Mount Options: nolock — Disables file locking. This setting is occasionally required when connecting to older NFS servers."

Source: https://www.centos.org/docs/5/html/Deployment_Guide-en-US/ch-nfs.html

NFSv4 has no interaction with portmapper, rpc.mountd, rpc.lockd, and rpc.statd, since protocol support has been incorporated into the v4 protocol. NFSv4 listens on the well known TCP port (2049) which eliminates the need for the portmapper interaction. The mounting and locking protocols have been incorpated into the V4 protocol which eliminates the need for interaction with rpc.mountd and rpc.lockd. rpc.lockd — allows NFS clients to lock files on the server. If rpc.lockd is not started, file locking will fail. rpc.lockd implements the Network Lock Manager (NLM) protocol. This process corresponds to the nfslock service. This is not used with NFSv4. rpc.statd — This process implements the Network Status Monitor (NSM) RPC protocol which notifies NFS clients when an NFS server is restarted without being gracefully brought down. This process is started automatically by the nfslock service and does not require user configuration. This is not used with NFSv4.

Source: https://wiki.wireshark.org/Network_Lock_Manager

The purpose of the NLM protocol is to provide something similar to POSIX advisory file locking semantics to NFS version 2 and 3. The lock manager is typically implemented completely inside user space in a lock manager daemon; that daemon will receive messages from the NFS client when a lock is requested, and will send NLM requests to the NLM server on the NFS server machine, and will receive NLM requests from NFS clients of the machine on which it's running and will make local locking calls on behalf of those clients. You need to run this lock manager daemon on BOTH the client and the server for lock management to work. Lock manager peers rely on the NSM protocol to notify each other of service restarts/reboots so that locks can be resynchronized after a reboot.

innergy commented 6 years ago

We should write this up in the troubleshooting section.

roycec commented 5 years ago

I just ran into this while trying to deploy trident to a different k8s cluster. If you have multiple installations of trident on the same netapp instance you will probably want to change igroupName and storagePrefix in addition to backendName. Changing these allowed trident to be deployed successfully.

Hello,

I'm running into a similar problem right now. I tried installing/uninstalling the latest trident driver several times. We used the same NetApp SVM before for tests with Openshift and want to reuse it for Kubernetes/Rancher. Is it possible to use the same SVM for more than one cluster? Or how or where can the attributes you mentioned be changed?

roycec commented 5 years ago

ok, I stumbled across a discussion regarding this issue on a NetApp slack channel. The "first" installation used a standard volume name for etcd storage on the SVM. When a second installation tries to create/use it, an error is raised. In this case a customized installation is necessary.

korenaren commented 5 years ago

Trident 19.07 no longer uses the etcd volume, so this is no longer an issue.

NetApp / trident

ENOLCK (no locks available) errors because rpc-statd is not running or GlusterFS is installed #175

mount xxx.xxx.xxx.xxx:/vol001 /mnt

touch /mnt/test.txt

flock /mnt/test.txt -c cat