PacktPublishing / Machine-Learning-on-Kubernetes

Machine Learning on Kubernetes, published by packt
MIT License
71 stars 45 forks source link

Chapter05 manifests/kfdef/ml-platform.yaml airflow problems #10

Open Cvija2609 opened 2 years ago

Cvija2609 commented 2 years ago

Platform: minikube version: v1.24.0

tl;dr airflow won't start, logs of everything are listed below

I'm trying to recreate everything and I'm stuck with this part. I've been waiting for some time for everything written in ml-platform.yaml to configure and app-aflow-airflow-web is in CrashLoopBackoff state for 1 hour now.

I've tried killing it, recreating it and nothing has worked.

Here is list of pods created during execution of this command:

kubectl apply -f manifests/kfdef/ml-platform.yaml -n ml-workshop 
NAME                                          READY   STATUS             RESTARTS        AGE
app-aflow-airflow-scheduler-f7fc5d4cb-dndwb   2/2     Running            2 (6m30s ago)   14m
app-aflow-airflow-web-54659fb97d-n6lms        1/2     CrashLoopBackOff   4 (29s ago)     2m34s
app-aflow-airflow-web-7c566d79d-4v2wv         1/2     CrashLoopBackOff   4 (19s ago)     2m33s
app-aflow-airflow-worker-0                    1/2     Running            0               2m17s
app-aflow-postgresql-0                        1/1     Running            0               14m
app-aflow-redis-master-0                      1/1     Running            0               14m
grafana-5dc6cf89d-vs8xd                       1/1     Running            0               14m
jupyterhub-7848ccd4b7-jkvpr                   1/1     Running            0               14m
jupyterhub-db-0                               1/1     Running            0               14m
minio-ml-workshop--1-m2bh4                    0/1     Completed          2               14m
minio-ml-workshop-6b84fdc7c4-7nsql            1/1     Running            0               14m
mlflow-d65ccb65d-8wpm6                        2/2     Running            0               14m
mlflow-db-0                                   1/1     Running            0               14m
seldon-controller-manager-7f67f4985b-bs5sq    1/1     Running            0               14m
spark-operator-69cfd96bf4-7h94n               1/1     Running            0               14m

I've changed to minikube ip as mentioned.

Logs from failing container app-aflow-airflow-web-7c566d79d-4v2wv:airflow-web:

airflow 14:25:27.02
airflow 14:25:27.02 Welcome to the Bitnami airflow container
airflow 14:25:27.02 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-airflow
airflow 14:25:27.02 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-airflow/issues
airflow 14:25:27.02
airflow 14:25:27.02 INFO  ==> Enabling non-root system user with nss_wrapper
airflow 14:25:27.03 INFO  ==> ** Starting Airflow setup **
airflow 14:25:27.05 INFO  ==> Initializing Airflow ...
airflow 14:25:27.06 INFO  ==> No injected configuration file found. Creating default config file
airflow 14:25:27.77 INFO  ==> Configuring Airflow webserver authentication
airflow 14:25:27.78 INFO  ==> Configuring Airflow database
airflow 14:25:27.81 INFO  ==> Configuring Celery Executor
airflow 14:25:27.83 INFO  ==> Waiting for PostgreSQL to be available at app-aflow-postgresql:5432...
Stream closed EOF for ml-workshop/app-aflow-airflow-web-7c566d79d-4v2wv (airflow-web)

describing pods also does not reveal much for me:

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  14m                   default-scheduler  Successfully assigned ml-workshop/app-aflow-airflow-web-7c566d79d-4v2wv to minikube
  Normal   Pulling    14m                   kubelet            Pulling image "registry.access.redhat.com/rhscl/postgresql-96-rhel7:latest"
  Normal   Pulled     14m                   kubelet            Successfully pulled image "registry.access.redhat.com/rhscl/postgresql-96-rhel7:latest" in 945.837211ms
  Normal   Created    14m                   kubelet            Created container waifordatabase
  Normal   Started    14m                   kubelet            Started container waifordatabase
  Normal   Pulling    14m                   kubelet            Pulling image "k8s.gcr.io/git-sync/git-sync:v3.2.2"
  Normal   Pulled     14m                   kubelet            Successfully pulled image "k8s.gcr.io/git-sync/git-sync:v3.2.2" in 2.879638022s
  Normal   Created    14m                   kubelet            Created container git-sync
  Normal   Started    14m                   kubelet            Started container git-sync
  Normal   Pulled     14m                   kubelet            Successfully pulled image "quay.io/ml-on-k8s/airflow:2.1.7.web.keycloak" in 1.810590705s
  Normal   Pulled     14m                   kubelet            Successfully pulled image "quay.io/ml-on-k8s/airflow:2.1.7.web.keycloak" in 1.999765805s
  Normal   Pulled     13m                   kubelet            Successfully pulled image "quay.io/ml-on-k8s/airflow:2.1.7.web.keycloak" in 2.210168418s
  Normal   Created    13m (x3 over 14m)     kubelet            Created container airflow-web
  Normal   Started    13m (x3 over 14m)     kubelet            Started container airflow-web
  Normal   Pulling    13m (x4 over 14m)     kubelet            Pulling image "quay.io/ml-on-k8s/airflow:2.1.7.web.keycloak"
  Warning  BackOff    4m14s (x46 over 13m)  kubelet            Back-off restarting failed container

replicaset:

Normal  SuccessfulCreate  15m   replicaset-controller  Created pod: app-aflow-airflow-web-7c566d79d-4v2wv 

kubectl logs:

$ kubectl logs -n ml-workshop app-aflow-airflow-web-7c566d79d-4v2wv
Defaulted container "git-sync" out of: git-sync, airflow-web, waifordatabase (init)
INFO: detected pid 1, running init handler
I1018 14:19:00.618669      12 main.go:430]  "level"=0 "msg"="starting up"  "args"=["/git-sync"] "pid"=12
I1018 14:19:00.618718      12 main.go:694]  "level"=0 "msg"="cloning repo"  "origin"="https://github.com/airflow-dags/dags/" "path"="/tmp/git"
I1018 14:19:14.308794      12 main.go:586]  "level"=0 "msg"="syncing git"  "hash"="8f22697a507c40bb42d4c674edd6b5c49ea0ecbb" "rev"="HEAD"
I1018 14:19:17.552166      12 main.go:607]  "level"=0 "msg"="adding worktree"  "branch"="origin/main" "path"="/tmp/git/rev-8f22697a507c40bb42d4c674edd6b5c49ea0ecbb"
I1018 14:19:17.556761      12 main.go:630]  "level"=0 "msg"="reset worktree to hash"  "hash"="8f22697a507c40bb42d4c674edd6b5c49ea0ecbb" "path"="/tmp/git/rev-8f22697a507c40bb42d4c674edd6b5c49ea0ecbb"
I1018 14:19:17.556781      12 main.go:635]  "level"=0 "msg"="updating submodules" 

previous logs:

$ kubectl logs -n ml-workshop app-aflow-airflow-web-7c566d79d-4v2wv --previous
Defaulted container "git-sync" out of: git-sync, airflow-web, waifordatabase (init)
Error from server (BadRequest): previous terminated container "git-sync" in pod "app-aflow-airflow-web-7c566d79d-4v2wv" not found

Service for postgresql exists and waitfordatabase executed successfully.

When I deleted this with:

kubectl delete -f manifests/kfdef/ml-platform.yaml -n ml-workshop 

and reapplied it with same command as mentioned above, airflow2-proxy secret was missing. Added that from manifests/airflow2/base/service-accounts.yaml and same error appeared.

webmakaka commented 2 years ago

Hi,

A am not sure that it can help, but you can try to run example from my working repo. https://github.com/webmakaka/Machine-Learning-on-Kubernetes

Working stand should looks like https://github.com/PacktPublishing/Machine-Learning-on-Kubernetes/issues/6#issuecomment-1221355813

Cvija2609 commented 2 years ago

thank You @webmakaka - tried it, still same problem :(

How long it took for You to initialize airflow? I'll leave it be and see if that may be the problem

webmakaka commented 2 years ago

less than 17m

sent you email on mar***@gm.com with my step by step instruction how to run environment for this book.

vishal-git commented 1 year ago

I am having problems with the same step. Everything until that step is working fine.

I am using --driver=docker and minikube version v1.28.0 on WSL2 (Ubuntu).

$ kubectl create -f manifests/kfdef/ml-platform.yaml -n ml-workshop

kfdef.kfdef.apps.kubeflow.org/opendatahub-ml-workshop created

This works fine.

But then none of the pods are being created (see below). I went through these steps multiple times (started all over again), but to no avail:

$ kubectl get pods -n ml-workshop
No resources found in ml-workshop namespace.

$ kubectl get all -n ml-workshop
NAME                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
service/jupyterhub      ClusterIP   10.97.249.33    <none>        8080/TCP,8081/TCP   40m
service/jupyterhub-db   ClusterIP   10.107.85.252   <none>        5432/TCP            40m
webmakaka commented 1 year ago

@vishal-git Try to repeat from scratch. You can use my repo and instructions inside, if needed.

Cvija2609 commented 1 year ago

I've tried everything multiple times and now I'm constantly getting error as @vishal-git

I've dig deeper and found out that opendatahub-operator is throwing this

...
configmap/jupyterhub-default-groups-config serverside-applied
configmap/spark-cluster-template serverside-applied
configmap/parameters serverside-applied
configmap/odh-jupyterhub-sizes serverside-applied
configmap/jupyter-singleuser-profiles serverside-applied
configmap/jupyterhub-cfg serverside-applied
persistentvolumeclaim/jupyterhub-db serverside-applied
serviceaccount/jupyterhub-hub serverside-applied
clusterrole.rbac.authorization.k8s.io/jupyterhub-cluster serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/jupyterhub-cluster serverside-applied
role.rbac.authorization.k8s.io/jupyterhub serverside-applied
service/jupyterhub-db serverside-applied
service/jupyterhub serverside-applied
ingress.networking.k8s.io/jupyterhub serverside-applied
route.route.openshift.io/jupyterhub serverside-applied
time="2023-01-02T13:00:29Z" level=warning msg="Encountered error applying application jupyterhub:  (kubeflow.error): Code 500 with message: Apply.Run : [failed to create typed patch object: .metadata.label: field not declared in schema, failed to create typed patch object: .roleRef.namespace: field not declared in schema, failed to create typed patch object: errors:\n  .spec.selector.deploymentconfig: field not declared in schema\n  .spec.strategy.recreateParams: field not declared in schema\n  .spec.triggers: field not declared in schema, failed to create typed patch object: errors:\n  .spec.selector.deploymentconfig: field not declared in schema\n  .spec.strategy: field not declared in schema]"
time="2023-01-02T13:00:29Z" level=warning msg="Will retry in 6 seconds."
mobaqii commented 1 year ago

i have tried multi time and i have the same issue as @vishal-git minikube version: v1.25.2

k logs -f opendatahub-operator-869cdfdf6f-drvf2 -n operators output logs shows the error

time="2023-01-04T18:47:29Z" level=info msg="Watch a change for Kubeflow resource: jupyterhub-db.ml-workshop." time="2023-01-04T18:47:29Z" level=info msg="Watch a change for Kubeflow resource: jupyterhub-db.ml-workshop." clusterrolebinding.rbac.authorization.k8s.io/jupyterhub-cluster serverside-applied time="2023-01-04T18:47:29Z" level=info msg="Watch a change for Kubeflow resource: jupyterhub-hub.ml-workshop." time="2023-01-04T18:47:29Z" level=info msg="Watch a change for Kubeflow resource: jupyterhub-hub.ml-workshop." role.rbac.authorization.k8s.io/jupyterhub serverside-applied time="2023-01-04T18:47:29Z" level=info msg="Watch a change for Kubeflow resource: jupyterhub-db.ml-workshop." time="2023-01-04T18:47:29Z" level=info msg="Watch a change for Kubeflow resource: jupyterhub-db.ml-workshop." service/jupyterhub-db serverside-applied time="2023-01-04T18:47:29Z" level=info msg="Watch a change for Kubeflow resource: jupyterhub-db.ml-workshop." time="2023-01-04T18:47:29Z" level=info msg="Watch a change for Kubeflow resource: jupyterhub-db.ml-workshop." service/jupyterhub serverside-applied ingress.networking.k8s.io/jupyterhub serverside-applied route.route.openshift.io/jupyterhub serverside-applied time="2023-01-04T18:47:30Z" level=warning msg="Encountered error applying application jupyterhub: (kubeflow.error): Code 500 with message: Apply.Run : [failed to create typed patch object: .metadata.label: field not declared in schema, failed to create typed patch object: .roleRef.namespace: field not declared in schema, failed to create typed patch object: errors:\n .spec.selector.app: field not declared in schema\n .spec.strategy.recreateParams: field not declared in schema\n .spec.triggers: field not declared in schema, failed to create typed patch object: errors:\n .spec.selector.app: field not declared in schema\n .spec.strategy: field not declared in schema]" time="2023-01-04T18:47:30Z" level=warning msg="Will retry in 4 seconds." configmap/jupyterhub-default-groups-config serverside-applied configmap/spark-cluster-template serverside-applied configmap/parameters serverside-applied configmap/odh-jupyterhub-sizes serverside-applied configmap/jupyter-singleuser-profiles serverside-applied configmap/jupyterhub-cfg serverside-applied

any ideas ..!

webmakaka commented 1 year ago

I think you should use recommended kubernetes version in minikube.

mobaqii commented 1 year ago

i already did , i start from scratch and still the same issue minikube version: v1.24.0 and kubernetes version 1.22.4 same as mentioned in Book

webmakaka commented 1 year ago

Can you try to run examples from my repo with instructions?

https://github.com/webmakaka/Machine-Learning-on-Kubernetes/tree/master/docs/01-environment

And then

https://github.com/webmakaka/Machine-Learning-on-Kubernetes/blob/master/docs/05-data-engineering.md

(Use google translate if needed translate from russian)

If something not work, i'll check it on my environment next week.

Cvija2609 commented 1 year ago

@webmakaka could You please try again running this whole setup? If You have resources available of course.

I've tried multiple times from scratch. I've even ran an EC2 instance on AWS - t3.2xlarge and tried with it, but with no success.

Minikube version and kubernetes version is same as in the book.

I've checked the logs in operator namespace again and opendatahub-operator throws same errors as before.

To sum up, tried multiple times getting same result. Something is not working as intended and I don't know what.

Airflow is not the problem anymore, I can't get to that point to check.

$ minikube profile list
|----------|-----------|---------|--------------|------|---------|---------|-------|
| Profile  | VM Driver | Runtime |      IP      | Port | Version | Status  | Nodes |
|----------|-----------|---------|--------------|------|---------|---------|-------|
| minikube | podman    | docker  | 192.168.49.2 | 8443 | v1.22.4 | Running |     1 |
|----------|-----------|---------|--------------|------|---------|---------|-------|

$ minikube config view
- cpus: 8
- disk-size: 60GB
- memory: 30GB

$ minikube version
minikube version: v1.24.0
commit: 76b94fb3c4e8ac5062daf70d60cf03ddcc0a741b
webmakaka commented 1 year ago

I checked. Same error as yours. If i find solution, i write how to fix.

webmakaka commented 1 year ago

I updated configs in my repo.

Current situation is:

Screenshot from 2023-01-20 03-39-31

Screenshot from 2023-01-20 03-39-47

softjobs commented 1 year ago

I am having problems with the same step. Everything until that step is working fine.

I am using --driver=docker and minikube version v1.28.0 on WSL2 (Ubuntu).

$ kubectl create -f manifests/kfdef/ml-platform.yaml -n ml-workshop

kfdef.kfdef.apps.kubeflow.org/opendatahub-ml-workshop created

This works fine.

But then none of the pods are being created (see below). I went through these steps multiple times (started all over again), but to no avail:

$ kubectl get pods -n ml-workshop
No resources found in ml-workshop namespace.

$ kubectl get all -n ml-workshop
NAME                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
service/jupyterhub      ClusterIP   10.97.249.33    <none>        8080/TCP,8081/TCP   40m
service/jupyterhub-db   ClusterIP   10.107.85.252   <none>        5432/TCP            40m

I now have the same issue reported by vishal-git above... Almost at a point to ditch this book and move on to some mature, proven content from OReilly -- edition 2 of the wildly praised book by Dr. Lakshmanan.

Almost total waste of crucial time on this untested book. Sorry folks, better luck next time!

webmakaka commented 1 year ago

Everything worked a year ago.

webmakaka commented 1 year ago

I updated my configs and now all pods runs. There was problems with new pods version from author registry.
When i returned to original, platform starts running without errors (at least it actual for page 105).

pic1