NetApp / trident

Storage orchestrator for containers
Apache License 2.0
761 stars 223 forks source link

Unable to add exact same backend that was successfully used for Trident install #105

Closed Routhinator closed 6 years ago

Routhinator commented 6 years ago

I just installed Trident on a fresh kube cluster. The initial install with the temporary backend went well and I see it's config volume trident_trident in the SVM.

However, continuing with the docs I tried to add that backend to the config after install using the exact same file and it fails:

tridenctl -n netapp-trident logs:

time="2018-03-26T21:59:07Z" level=info msg="Running Trident storage orchestrator." binary=/usr/local/bin/trident_orchestrator build_time="Thu Jan 25 00:42:53 UTC 2018" version=18.01.0
time="2018-03-26T21:59:13Z" level=info msg="Kubernetes frontend determined the container orchestrator version." gitVersion=v1.8.4+coreos.0 version=1.8+
time="2018-03-26T21:59:13Z" level=info msg="Added frontend." name=kubernetes
time="2018-03-26T21:59:13Z" level=info msg="Starting REST interface on localhost:8000"
time="2018-03-26T21:59:13Z" level=info msg="Added frontend." name=REST
time="2018-03-26T21:59:13Z" level=info msg="Transforming persistent state." current_store_version=etcdv2 desired_store_version=etcdv3
time="2018-03-26T21:59:13Z" level=info msg="No key with prefix /trident to migrate."
time="2018-03-26T21:59:13Z" level=info msg="trident bootstrapped successfully."
time="2018-03-26T22:05:18Z" level=error msg="API invocation failed. Post https://192.168.10.14/servlets/netapp.servlets.admin.XMLrequest_filer: dial tcp 192.168.10.14:443: getsockopt: connection timed out"
time="2018-03-26T22:05:18Z" level=error msg="Problem initializing storage driver: 'ontap-nas' error: Error initializing ontap-nas driver. Could not determine Data ONTAP API version. Could not read ONTAPI version. Post https://192.168.10.14/servlets/netapp.servlets.admin.XMLrequest_filer: dial tcp 192.168.10.14:443: getsockopt: connection timed out" backend= handler=AddBackend
time="2018-03-26T22:05:18Z" level=info msg="API server REST call." duration=2m7.269380129s method=POST route=AddBackend uri=/trident/v1/backend

setup/backend.json:

{
    "version": 1,
    "storageDriverName": "ontap-nas",
    "managementLIF": "192.168.10.14",
    "dataLIF": "192.168.10.14",
    "svm": "dockervols",
    "username": "vsadmin",
    "password": "<sanitized>"
}

I don't understand why it cannot connect to the same system it just used to set itself up. The user has full admin permissions, but the error seems to indicate a socket timeout. However when I login to the Kube host, I can telnet port 443 on the SVM just fine.

adkerr commented 6 years ago

Reusing the same backend should work fine. We do it all the time internally, even our CI system does this.

Can you verify that your system responds to the ZAPI call properly?

On your Kubernetes host, create a file system-get-version.xml with the following contents

<?xml version="1.0" encoding="UTF-8"?>

<netapp xmlns="http://www.netapp.com/filer/admin" version="1.21">
 <system-get-version/>
</netapp>

Then use this curl command curl -k -X POST -d @./system-get-version.xml https://vsadmin:password@192.168.10.14/servlets/netapp.servlets.admin.XMLrequest_filer

Routhinator commented 6 years ago

@adkerr Thanks for helping. The answer seems to be yes, it responds to the ZAPI call:

<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE netapp SYSTEM 'file:/etc/netapp_gx.dtd'>
<netapp version='1.110' xmlns='http://www.netapp.com/filer/admin'>
<results status="passed"><build-timestamp>1474834249</build-timestamp><is-clustered>true</is-clustered><version>NetApp Release 9.1RC1: Sun Sep 25 20:10:49 UTC 2016</version><version-tuple><system-version-tuple><generation>9</generation><major>1</major><minor>0</minor></system-version-tuple></version-tuple></results></netapp>
clintonk commented 6 years ago

@Routhinator, what do you see if you run 'tridentctl logs' ?

Routhinator commented 6 years ago

@clintonk That is in the original post. First code block is the tridentctl logs output.

clintonk commented 6 years ago

@Routhinator OK, can you check connectivity to the storage from within the Trident pod?

kubectl exec <trident pod name> -n trident -c trident-main -- ping 192.168.10.14
Routhinator commented 6 years ago

@clintonk Works:

kubectl exec trident-9dd6bc758-9dwsp -n netapp-trident -c trident-main -- ping 192.168.10.14
PING 192.168.10.14 (192.168.10.14): 56 data bytes
64 bytes from 192.168.10.14: seq=0 ttl=254 time=2.734 ms
64 bytes from 192.168.10.14: seq=1 ttl=254 time=0.315 ms
64 bytes from 192.168.10.14: seq=2 ttl=254 time=0.299 ms
Routhinator commented 6 years ago

@clintonk Ran some more tests.. don't know what it means yet but it seems that the pod can only connect to the SVM over port 22, while 80 and 443 timeout, but 80 and 443 work from the host....

kubectl exec trident-9dd6bc758-9dwsp -n netapp-trident -c trident-main -- telnet 192.168.10.14 443
Connection closed by foreign host
command terminated with exit code 1

kubectl exec trident-9dd6bc758-9dwsp -n netapp-trident -c trident-main -- telnet 192.168.10.14 22
SSH-2.0-OpenSSH_6.6.1_hpn13v11 FreeBSD-20140420

kubectl exec trident-9dd6bc758-9dwsp -n netapp-trident -c trident-main -- telnet 192.168.10.14 80
Connection closed by foreign host
command terminated with exit code 1
Routhinator commented 6 years ago

For the record this is a dedicated SAN/NAS network with no firewall. So there should be no issue.

Routhinator commented 6 years ago

Anyone have any ideas on this? Why would a pod not be able to reach certain ports like this? I could think it was a kube routing issue, but port 22 works, so this is port specific with this pod. It doesn't affect the host, and the installer pod was able to access this fine in order for me to get this far.

Is this a bug? Should I just remove the pod and try again?

Routhinator commented 6 years ago

I destroyed the pod and recreated it and it was able to add the backed. Looks like there was some kind of bug with the original pod.

Routhinator commented 6 years ago

I let the maintainers decide if this was on the Kube side or a possible setup bug and close as appropriate.

adkerr commented 6 years ago

It certainly feels like a k8s issue, since the bootstrap pod, host, and subsequent pod were all able to communicate to the filer without issue.

@kangarlou, any reason not to close this?

kangarlou commented 6 years ago

I doubt the issue is related to Trident. I'm fine with closing the issue. We can reopen it if the problem happens again. @Routhinator What network plugin are you using?