NetApp / trident

Storage orchestrator for containers
Apache License 2.0
761 stars 223 forks source link

Unable to install v18.04 on OpenShift 3.9 #120

Closed kaparora closed 6 years ago

kaparora commented 6 years ago

we followed the installations steps described at https://netapp-trident.readthedocs.io/en/stable-v18.04/kubernetes/deploying.html#download-extract-the-installer

we starting the installation on the master.

[root@se1-ocpma-e100 trident-installer]# cat setup/backend.json 
{ 
    "version": 1, 
    "storageDriverName": "ontap-nas", 
    "managementLIF": "4.168.16.25", 
    "username": "aaa", 
    "password": "xxx", 
    "defaults": { 
      "spaceReserve": "none", 
      "exportPolicy": "openshift" 
    } 
} 

[root@se1-ocpma-e100 trident-installer]# ./tridentctl install -n netapp2 -d 
DEBU Initialized logging.                          logLevel=debug 

DEBU Initialized Kubernetes CLI client.            cli=oc flavor=openshift namespace=netapp2 version=1.9. 
DEBU Validated Trident installation environment.   installationNamespace=netapp2 kubernetesVersion=1.9.1+ 
DEBU Parsed requested volume size.                 quantity=2Gi 
DEBU Namespace exists.                             namespace=netapp2 
DEBU PVC does not exist.                           pvc=trident 
DEBU PV does not exist.                            pv=trident 
INFO Starting storage driver.                      backend=/root/trident-installer/setup/backend.json 
DEBU config: {"defaults":{"exportPolicy":"openshift","spaceReserve":"none"},"managementLIF":"4.168.16.25" 
DEBU Storage prefix is absent, will use default prefix. 
DEBU Parsed commonConfig: {Version:1 StorageDriverName:ontap-nas BackendName: Debug:false DebugTraceFlags 
DEBU Initializing storage driver.                  driver=ontap-nas 
DEBU Addresses found from ManagementLIF lookup.    addresses="[4.168.16.25]" hostname=4.168.16.25 
DEBU Using derived SVM.                            SVM=se1-svm-s01 
DEBU ONTAP API version.                            Ontapi=1.110 
WARN Could not determine controller serial numbers. API status: failed, Reason: Unable to find API: syste 
DEBU Configuration defaults                        Encryption=false ExportPolicy=openshift FileSystemTypene SplitOnClone=false StoragePrefix=trident_ UnixPermissions=---rwxrwxrwx 
DEBU Data LIFs                                     dataLIFs="[4.168.16.25]" 
DEBU Found NAS LIFs.                               dataLIFs="[4.168.16.25]" 
DEBU Configured EMS heartbeat.                     intervalHours=24 
DEBU Read storage pools assigned to SVM.           pools="[sdeb_nas_t001_data01 sdeb_nas_t002_data01]" sv 
DEBU Read aggregate attributes.                    aggregate=sdeb_nas_t001_data01 mediaType=hdd 
DEBU Read aggregate attributes.                    aggregate=sdeb_nas_t002_data01 mediaType=hdd 
DEBU Storage driver initialized.                   driver=ontap-nas 
INFO Storage driver loaded.                        driver=ontap-nas 
INFO Starting Trident installation.                namespace=netapp2 
DEBU Deleted Kubernetes object by YAML. 
DEBU Deleted cluster role binding. 
DEBU Deleted Kubernetes object by YAML. 
DEBU Deleted cluster role. 
DEBU Deleted Kubernetes object by YAML. 
DEBU Deleted service account. 
DEBU Removed Trident user from security context constraint. 
DEBU Created Kubernetes object by YAML. 
INFO Created service account. 
DEBU Created Kubernetes object by YAML. 
INFO Created cluster role. 
DEBU Created Kubernetes object by YAML. 
INFO Created cluster role binding. 
INFO Added Trident user to security context constraint. 
DEBU Created Kubernetes object by YAML. 
INFO Created PVC. 
DEBU Attempting volume create.                     size=2147483648 storagePool=sdeb_nas_t001_data01 volCo 
DEBU Created Kubernetes object by YAML. 
INFO Created PV.                                   pv=trident 
INFO Waiting for PVC to be bound.                  pvc=trident 
DEBU PVC not yet bound, waiting.                   increment=619.512855ms pvc=trident 
DEBU PVC not yet bound, waiting.                   increment=676.793322ms pvc=trident 
DEBU PVC not yet bound, waiting.                   increment=1.225961586s pvc=trident 
DEBU Logged EMS message.                           driver=ontap-nas 
DEBU PVC not yet bound, waiting.                   increment=1.328790335s pvc=trident 
DEBU Created Kubernetes object by YAML. 
INFO Created Trident deployment. 
INFO Waiting for Trident pod to start. 
DEBU Trident pod not yet running, waiting.         increment=619.624506ms 
DEBU Trident pod not yet running, waiting.         increment=870.617544ms 
DEBU Trident pod not yet running, waiting.         increment=844.84827ms 
INFO Trident pod started.                          namespace=netapp2 pod=trident-cdd5fc7b4-ls8h4 
INFO Waiting for Trident REST interface. 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
DEBU REST interface not yet up, waiting.           increment=360.640418ms 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
DEBU REST interface not yet up, waiting.           increment=877.614503ms 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
DEBU REST interface not yet up, waiting.           increment=1.520820412s 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
DEBU REST interface not yet up, waiting.           increment=1.834092202s 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
DEBU REST interface not yet up, waiting.           increment=3.152914941s 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
DEBU REST interface not yet up, waiting.           increment=3.145476382s 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
DEBU REST interface not yet up, waiting.           increment=6.207780768s 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
DEBU REST interface not yet up, waiting.           increment=5.170037335s 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
DEBU REST interface not yet up, waiting.           increment=18.007844228s 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
DEBU REST interface not yet up, waiting.           increment=16.276606311s 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
DEBU REST interface not yet up, waiting.           increment=34.967432358s 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
DEBU REST interface not yet up, waiting.           increment=42.703850717s 
DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main -- tridentctl 
ERRO Trident REST interface was not available after 120.00 seconds. 
WARN An error occurred during installation, cleaning up. 
DEBU Deleted Kubernetes object by YAML. 
INFO Deleted cluster role binding. 
DEBU Deleted Kubernetes object by YAML. 
INFO Deleted cluster role. 
DEBU Deleted Kubernetes object by YAML. 
INFO Deleted service account. 
INFO Removed Trident user from security context constraint. 
DEBU Deleted Kubernetes object by name.            pvc=trident 
INFO Deleted PVC.                                  pvc=trident 
DEBU Deleted Kubernetes object by name.            pv=trident 
INFO Deleted PV.                                   pv=trident 
FATA Install failed; exit status 1; Error: could not get version. 500 Internal Server Error 
command terminated with exit code 1; use 'tridentctl logs' to learn more 
[root@se1-ocpma-e100 trident-installer]# 

while the the container is running we got the following output inside the container

[root@se1-ocpma-e100 ~]# oc rsh  trident-cdd5fc7b4-ls8h4 
Defaulting container name to trident-main. 
Use 'oc describe pod/trident-cdd5fc7b4-ls8h4 -n netapp2' to see all of the containers in this pod. 
/ # tridentctl -s 127.0.0.1:8000 version -o json 
Error: could not get version. 500 Internal Server Error 
/ # 

the events regarding the namespace are

[root@se1-ocpma-e100 ~]# oc get ev 
LAST SEEN   FIRST SEEN   COUNT     NAME                                       KIND                    SUB                              MESSAGE 
1m          1m           1         trident-cdd5fc7b4-ls8h4.15284c6cb57af3b1   Pod                        scheduler                     Successfully assigned trident-cdd5fc7b4-ls8h4 to se1-ocpco-e142.sys.schwarz 
1m          1m           1         trident-cdd5fc7b4-ls8h4.15284c6cc51e36b5   Pod                         se1-ocpco-e142.sys.schwarz   MountVolume.SetUp succeeded for volume "trident-token-zx6zx" 
1m          1m           1         trident-cdd5fc7b4-ls8h4.15284c6cc6422647   Pod                         se1-ocpco-e142.sys.schwarz   MountVolume.SetUp succeeded for volume "trident" 
1m          1m           1         trident-cdd5fc7b4-ls8h4.15284c6d154ff9ec   Pod                     spe se1-ocpco-e142.sys.schwarz   Container image "netapp/trident:18.04.0" already present on machine 
1m          1m           1         trident-cdd5fc7b4-ls8h4.15284c6d17e80af1   Pod                     spe se1-ocpco-e142.sys.schwarz   Created container 
1m          1m           1         trident-cdd5fc7b4-ls8h4.15284c6d1dfb02a7   Pod                     spe se1-ocpco-e142.sys.schwarz   Started container 
1m          1m           1         trident-cdd5fc7b4-ls8h4.15284c6d1e1580f3   Pod                     spe se1-ocpco-e142.sys.schwarz   Container image "quay.io/coreos/etcd:v3.1.5" already present on machine 
1m          1m           1         trident-cdd5fc7b4-ls8h4.15284c6d21c24ec8   Pod                     spe se1-ocpco-e142.sys.schwarz   Created container 
1m          1m           1         trident-cdd5fc7b4-ls8h4.15284c6d27f171b5   Pod                     spe se1-ocpco-e142.sys.schwarz   Started container 
1m          1m           1         trident-cdd5fc7b4.15284c6cb4e7e7c1         ReplicaSet                 et-controller                 Created pod: trident-cdd5fc7b4-ls8h4 
1m          1m           1         trident.15284c6a2984d474                   PersistentVolumeClaim      ntvolume-controller           no persistent volumes available for this claim and no storage class is set 
1m          1m           1         trident.15284c6cb37d0c52                   Deployment                 nt-controller                 Scaled up replica set trident-cdd5fc7b4 to 1 
kaparora commented 6 years ago

trident-logs-all.log Attached trident logs

jkonline commented 6 years ago

having the same issue was this ever resolved

kaparora commented 6 years ago

Today we got Trident running with iSCSI (onatp-san) driver. Everything works fine from installation to provisioning to mounting and consuming storage.

We added NFS as a backend to trident and used it for a mysql deployment. MySql doesn’t work either like ETCD with NFS backend. Here are the logs:

=> sourcing 20-validate-variables.sh ... 
=> sourcing 25-validate-replication-variables.sh ... 
=> sourcing 30-base-config.sh ... 
---> 08:41:11     Processing basic MySQL configuration files ... 
=> sourcing 60-replication-config.sh ... 
=> sourcing 70-s2i-config.sh ... 
---> 08:41:11     Processing additional arbitrary  MySQL configuration provided by s2i ... 
=> sourcing 40-paas.cnf ... 
=> sourcing 50-my-tuning.cnf ... 
---> 08:41:11     Initializing database ... 
---> 08:41:11     Running mysqld --initialize-insecure ... 
2018-05-18T08:41:11.628403Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details). 
2018-05-18T08:41:11.629989Z 0 [Warning] Duplicate ignore-db-dir directory name 'lost+found' found in the config file(s). Ignoring the duplicate. 
2018-05-18T08:41:11.630674Z 0 [ERROR] --initialize specified but the data directory has files in it. Aborting. 
2018-05-18T08:41:11.630700Z 0 [ERROR] Aborting 

NFS provisioning and mounting is fine.

I tried to mount nfs volume on a host(worker node) and write to it and it works.

This may have something to do with OpenShift user permissions inside the pod. I have no clue. Any inputs are appreciated.

innergy commented 6 years ago

We're definitely not seeing this in general. Our CI tests with this combination, the same versions. In cases like these there is usually a configuration issue either on the host or on the storage backend that's getting in the way. Troubleshooting this over GitHub would likely require a great deal of back and forth, therefore my suggestion would be to open up a case so that we can work through it live.

kaparora commented 6 years ago

Thanks @innergy! a Support case is already open.

jkonline commented 6 years ago

@kapilarora How did you resolve this error on the initial install:

DEBU Invoking tunneled command: oc exec trident-cdd5fc7b4-ls8h4 -n netapp2 -c trident-main --tridentctl
DEBU REST interface not yet up, waiting.
jacobjohnanda commented 6 years ago

Having an issue with Trident 18.04 Install with Openshift 3.7 as well.

acsulli commented 6 years ago

@kapilarora,

Any chance you can check the latency between your OpenShift nodes and the data LIF(s)? Just encountered a situation where extreme latency (> 200ms) was causing etcd to (apparently) falsely believe there were locks. Changing to a storage device which is dramatically closer fixed things.

I have no idea at what point the latency might become an issue for etcd, but it would be worth knowing if this could be an issue for you. All of the CI testing is with systems which are a couple ms apart at most, so it's not something we've encountered before.

Andrew

kaparora commented 6 years ago

Trident is able to server both backedns NFS and iSCSI PostgreSQL runs fine with NFS we are having issues still with mysql. This is a configuration issue but I dont think we can solve it at trident level. Hence I am closing this issue for now. I am also not able to recreate it in my lab. The customer support case has also been closed.

kaparora commented 6 years ago

Today after some troubleshooting we figured that by default openshift template has mountPath /var/lib/mysql/data We changed it to /val/lib/mysql after looking at this issue : https://github.com/docker-library/mysql/issues/69

And, mysql is now running in the OpenShift cluster with ONTAP NFS

japplewhite commented 6 years ago

I'm seeing this too with OpenShift 3.9 Origin and iSCSI with a ONTAP simulator that was working on earlier deployments

japplewhite commented 6 years ago

@kapilarora I have an env to reproduce it

rushins commented 6 years ago

i hit the same error today with openshift 3.9 using NFS ontap-cdot 9.1 release. FATA Install failed; PVC trident was not bound after 120000000000 seconds

any idea

japplewhite commented 6 years ago

@rushins I had success with the newest Trident beta release on Origin 3.9. I was using iscsi though so YMMV. What I have found is it works best on the first go. If you have an existing install that failed you must clean up on the fas by deleting the volume and lun (for iscsi) before proceeding to try again.

acsulli commented 6 years ago

@rushins, @japplewhite Make sure that something else doesn't have a pending PVC when creating Trident. In the original 18.04 there was a bug which resulted in a missing piece of metadata preventing the trident PV from being bound by another PVC. This is/was particularly an issue with OpenShift Enterprise, which deploys Ansible Service Broker (a.k.a. ASB) by default.

If, after starting the Trident install, you do a oc get pvc --all-namespaces and you see a PVC which is bound to the trident PV, that is a good indicator.

This was fixed in 18.07 beta 1.

Andrew

rushins commented 6 years ago

thanks Andhrew. Yes you are right 18.04 seems have bug with PVC bound . I have followed your solution to use 18.07 beta 1 and it worked without any major issue in openshift container platform as a storage class and i was able to create PV and bound it to PVC ?

thanks.

rushins commented 6 years ago

Hi john, i tried with ISCSI and it didn't work as i found its bug as stated by Andrew on build 18.04 ? so 18.07 beta 1 worked for all NAS and SAN traffic ( NFS, ISCSI) .

Anyways, thanks for your suggestion.