InfuseAI / primehub

open-source MLOps platform
https://docs.primehub.io
Apache License 2.0
390 stars 39 forks source link

[Bug] Install CE script freezes #334

Closed sstiene closed 3 years ago

sstiene commented 3 years ago

What happened:

I ran into Problems with the PrimeHub CE install script

What you expected to happen:

That the install script runs smoothly

How to reproduce it (as minimally and precisely as possible):

sudo ufw disable sudo iptables --policy INPUT ACCEPT sudo iptables --policy FORWARD ACCEPT

curl -O https://storage.googleapis.com/primehub-release/bin/primehub-install chmod +x primehub-install ./primehub-install create singlenode ./primehub-install create primehub --primehub-version v3.5.2 --primehub-ce --helm-timeout 20m --enable-https

Anything else we need to know?:

Output:

[Search] Folder primehub-v3.5.2 [Not Found] Folder primehub-v3.5.2 [Search] tarball primehub-v3.5.2.tar.gz [Not Found] tarball primehub-v3.5.2.tar.gz [Search] primehub helm chart with version: v3.5.2 [Preflight Check] [Preflight Check] Pass [Prepare] PrimeHub require values Please enter PRIMEHUB_DOMAIN: XXXX (I took the right domain here, this can't be the problem...) Please enter KC_PASSWORD: XXXX Please enter PH_PASSWORD: XXXX [Init] primehub config [Create] /home/mhoellmann/.primehub/config/microk8s/.env [Verify] Domain Name: https://primehub.ni.dfki.de/healthz

[Check] Cert Manager [Install] Cert Manager "jetstack" already exists with the same configuration, skipping Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "ingress-nginx" chart repository ...Successfully got an update from the "primehub" chart repository ...Successfully got an update from the "jetstack" chart repository ...Successfully got an update from the "infuseai" chart repository Update Complete. ⎈Happy Helming!⎈ NAME: cert-manager LAST DEPLOYED: Tue May 25 09:35:24 2021 NAMESPACE: cert-manager STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: cert-manager has been deployed successfully!

In order to begin issuing certificates, you will need to set up a ClusterIssuer or Issuer resource (for example, by creating a 'letsencrypt-staging' issuer).

More information on the different types of issuers and how to configure them can be found in our documentation:

https://cert-manager.io/docs/configuration/

For information on how to configure cert-manager to automatically provision Certificates for Ingress resources, take a look at the ingress-shim documentation:

https://cert-manager.io/docs/usage/ingress/ Waiting for deployment "cert-manager-webhook" rollout to finish: 0 of 1 updated replicas are available... deployment "cert-manager-webhook" successfully rolled out No resources found in default namespace. [Apply] Cluster Issuer: letsencrypt-prod clusterissuer.cert-manager.io/letsencrypt-prod created [Install] PrimeHub [Check] primehub.yaml [Generate] primehub.yaml for CE [Install] PrimeHub "infuseai" already exists with the same configuration, skipping Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "primehub" chart repository ...Successfully got an update from the "ingress-nginx" chart repository ...Successfully got an update from the "jetstack" chart repository ...Successfully got an update from the "infuseai" chart repository Update Complete. ⎈Happy Helming!⎈ Release "primehub" does not exist. Installing it now. coalesce.go:196: warning: cannot overwrite table with non table for extraEnv (map[])

Output of

Every 2.0s: kubectl get pod -n hub primehub: Tue May 25 10:01:45 2021 NAME READY STATUS RESTARTS AGE csi-controller-rclone-0 3/3 Running 0 26m csi-nodeplugin-rclone-qh862 2/2 Running 0 26m hub-64cc75f4b5-vxcmn 0/1 CreateContainerConfigError 0 26m keycloak-0 0/1 Init:0/2 0 26m keycloak-postgres-0 0/1 Pending 0 26m metacontroller-0 1/1 Running 0 26m primehub-admission-655bdb9f75-8pcnm 1/1 Running 0 26m primehub-bootstrap-j55kx 1/1 Running 0 26m primehub-console-d94676478-px2t9 0/1 CreateContainerConfigError 0 26m primehub-controller-b7d878c54-q6trx 2/2 Running 0 26m primehub-fluentd-q66p6 1/1 Running 0 26m primehub-graphql-657b7478ff-5djrl 0/1 CreateContainerConfigError 0 26m primehub-metacontroller-webhook-7597f68459-lx9pc 1/1 Running 0 26m primehub-minio-0 1/1 Running 0 26m primehub-shared-space-tusd-5c9cd99d98-psx9h 1/1 Running 0 26m primehub-watcher-5b4c84f5b9-fdg4g 0/1 CreateContainerConfigError 0 26m proxy-6bb567848c-sttzb 1/1 Running 0 26m

kubectl logs -n hub $(kubectl get pod -n hub | grep primehub-bootstrap | cut -d' ' -f1) -f

repeats [xxx] http://keycloak-http.hub/auth [xxx] http://keycloak-http.hub/auth [xxx] http://keycloak-http.hub/auth [xxx] http://keycloak-http.hub/auth [xxx] http://keycloak-http.hub/auth [xxx] http://keycloak-http.hub/auth [xxx] http://keycloak-http.hub/auth [xxx] http://keycloak-http.hub/auth ...

firewall was disabled so this can't be the problem

Environment:

gives only

NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.11", GitCommit:"ea5f00d93211b7c80247bf607cfa422ad6fb5347", GitTreeState:"clean", BuildDate:"2020-08-13T15:20:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.17", GitCommit:"f3abc15296f3a3f54e4ee42e830c61047b13895f", GitTreeState:"clean", BuildDate:"2021-01-13T13:13:00Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

self hosted virtual machine

NAME="Ubuntu" VERSION="20.04.2 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.2 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

Linux primehub 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

kentwelcome commented 3 years ago

Hi sstiene,

Based on the logs you provide, seems keycloak-postgres-0 can not be scheduled.

...
keycloak-postgres-0 0/1 Pending 0 26m
...

I guess it might be caused by keycloak-postgres-0 can not mount the pvc. Can you please also help to run the following commands?

kubectl get pvc -n hub
kubectl get storageclass

Thanks.

sstiene commented 3 years ago

kubectl get pvc -n hub NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE data-keycloak-postgres-0 Bound pvc-8b6f35ef-1a32-4222-ba44-04ba0c58fcbd 8Gi RWO microk8s-hostpath 2m10s export-primehub-minio-0 Bound pvc-f9dce678-1e75-4b85-aa66-3b8e24260acd 10Gi RWO microk8s-hostpath 2m10s hub-db-dir Bound pvc-980004d2-4438-43cc-b9b8-bcc4b9ac97ab 1Gi RWO microk8s-hostpath 2m11s primehub-store Bound primehub-store 64Gi RWX rclone 2m11s

kubectl get storageclass NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE microk8s-hostpath (default) microk8s.io/hostpath Delete Immediate false 6m39s rclone kubernetes.io/no-provisioner Delete Immediate false 3m34s

kentwelcome commented 3 years ago

Seems the pvc of keycloak-postgres had been provisioned successfully. Looks like the pending issue is not caused by unable mount pvc.

Another reason for pod pending could be caused by resources are not enough. (CPU or Memory) May I ask how many vCPU and Memory you configured for this virtual machine? Or you can run the following command to show the reason why keycloak-postgres-0 is pending

kubectl describe pod -n hub keycloak-postgres-0

In general, PrimeHub needs at least 4vCPU and 8G memory to run. Thanks

sstiene commented 3 years ago

Yes, that might be the case. I only set up a very small test environment with 2 CPUs. I will extend it and come back to you. Thanks. Does it make sense to add a resource check in the install script and give a warning if CPU, RAM, MEM,... is not sufficient?

kentwelcome commented 3 years ago

It's a great idea! We will add your feedback to our milestone. Truly thanks for your suggestion.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.