GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
658 stars 266 forks source link

Flink Operator installation is failing #467

Open sumchak1 opened 2 years ago

sumchak1 commented 2 years ago

Hi Team, Flink operator installation in IBM cloud is failing with CrashLoopBackOff error. Please see below for more details:

$ k get all -n flink-operator-system
NAME                                                     READY   STATUS             RESTARTS   AGE
pod/cert-job-wtvwr                                       0/1     Completed          0          11m
pod/flink-operator-controller-manager-848b69b444-86bf2   1/2     CrashLoopBackOff   5          4m29s

NAME                                                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/flink-operator-controller-manager-metrics-service   ClusterIP   172.21.152.91    <none>        8443/TCP   26m
service/flink-operator-webhook-service                      ClusterIP   172.21.210.155   <none>        443/TCP    26m

NAME                                                READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/flink-operator-controller-manager   0/1     1            0           26m

NAME                                                           DESIRED   CURRENT   READY   AGE
replicaset.apps/flink-operator-controller-manager-848b69b444   1         1         0       26m

NAME                 COMPLETIONS   DURATION   AGE
job.batch/cert-job   1/1           3s         11m
$ k logs flink-operator-controller-manager-848b69b444-t5k2n -n flink-operator-system
Error from server (NotFound): pods "flink-operator-controller-manager-848b69b444-t5k2n" not found
[sumit@sumit flink]$ k logs flink-operator-controller-manager-848b69b444-86bf2 -n flink-operator-system --all-containers
I0729 12:43:04.599359       1 main.go:209] Generating self signed cert as no cert is provided
I0729 12:43:04.820946       1 main.go:242] Listening securely on 0.0.0.0:8443
W0729 12:49:19.067631       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2021-07-29T12:49:20.074Z    INFO    controller-runtime.metrics  metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
2021-07-29T12:49:20.074Z    INFO    controller-runtime.builder  Registering a mutating webhook  {"GVK": "flinkoperator.k8s.io/v1beta1, Kind=FlinkCluster", "path": "/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster"}
2021-07-29T12:49:20.074Z    INFO    controller-runtime.webhook  registering webhook {"path": "/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster"}
2021-07-29T12:49:20.074Z    INFO    controller-runtime.builder  Registering a validating webhook    {"GVK": "flinkoperator.k8s.io/v1beta1, Kind=FlinkCluster", "path": "/validate-flinkoperator-k8s-io-v1beta1-flinkcluster"}
2021-07-29T12:49:20.074Z    INFO    controller-runtime.webhook  registering webhook {"path": "/validate-flinkoperator-k8s-io-v1beta1-flinkcluster"}
2021-07-29T12:49:20.074Z    INFO    setup   Starting manager
2021-07-29T12:49:20.106Z    INFO    controller-runtime.manager  starting metrics server {"path": "/metrics"}
2021-07-29T12:49:20.106Z    INFO    controller-runtime.controller   Starting EventSource    {"controller": "flinkcluster", "source": "kind source: /, Kind="}
2021-07-29T12:49:20.106Z    INFO    controller-runtime.webhook.webhooks starting webhook server
2021-07-29T12:49:20.127Z    INFO    controller-runtime.certwatcher  Updated current TLS certificate
2021-07-29T12:49:20.127Z    INFO    controller-runtime.webhook  serving webhook server  {"host": "", "port": 443}
2021-07-29T12:49:20.135Z    INFO    controller-runtime.certwatcher  Starting certificate watcher
2021-07-29T12:49:20.307Z    INFO    controller-runtime.controller   Starting EventSource    {"controller": "flinkcluster", "source": "kind source: /, Kind="}
2021-07-29T12:49:21.208Z    INFO    controller-runtime.controller   Starting EventSource    {"controller": "flinkcluster", "source": "kind source: /, Kind="}
2021-07-29T12:49:21.309Z    INFO    controller-runtime.controller   Starting EventSource    {"controller": "flinkcluster", "source": "kind source: /, Kind="}

$ k describe pod flink-operator-controller-manager-848b69b444-86bf2 -n flink-operator-system
Name:               flink-operator-controller-manager-848b69b444-86bf2
Namespace:          flink-operator-system
Priority:           0
PriorityClassName:  <none>
Node:               10.148.145.115/10.148.145.115
Start Time:         Thu, 29 Jul 2021 18:13:02 +0530
Labels:             app=flink-operator
                    control-plane=controller-manager
                    pod-template-hash=848b69b444
Annotations:        cni.projectcalico.org/podIP: 172.30.149.218/32
                    cni.projectcalico.org/podIPs: 172.30.149.218/32
                    kubernetes.io/psp: ibm-privileged-psp
Status:             Running
IP:                 172.30.149.218
Controlled By:      ReplicaSet/flink-operator-controller-manager-848b69b444
Containers:
  kube-rbac-proxy:
    Container ID:  containerd://85ff7f6cf568dc376a0b248be8022e81bd3da48c83ea4461f050694c4a22acec
    Image:         gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
    Image ID:      gcr.io/kubebuilder/kube-rbac-proxy@sha256:297896d96b827bbcb1abd696da1b2d81cab88359ac34cce0e8281f266b4e08de
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:8443
      --upstream=http://127.0.0.1:8080/
      --logtostderr=true
      --v=10
    State:          Running
      Started:      Thu, 29 Jul 2021 18:13:04 +0530
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-5c4cn (ro)
  flink-operator:
    Container ID:  containerd://2880f9d0de68228476738c8ce01d6f82c36149709e60eb33b014b1aaad19a073
    Image:         gcr.io/flink-operator/flink-operator:latest
    Image ID:      gcr.io/flink-operator/flink-operator@sha256:af78aef1e6ca3e082f5d03b53db09fe0d31e21424ac87c9f0204b3739001d3cc
    Port:          443/TCP
    Host Port:     0/TCP
    Command:
      /flink-operator
    Args:
      --metrics-addr=127.0.0.1:8080
      --watch-namespace=
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 29 Jul 2021 18:16:32 +0530
      Finished:     Thu, 29 Jul 2021 18:16:36 +0530
    Ready:          False
    Restart Count:  5
    Limits:
      cpu:     100m
      memory:  30Mi
    Requests:
      cpu:        100m
      memory:     20Mi
    Environment:  <none>
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-5c4cn (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  webhook-server-cert
    Optional:    false
  default-token-5c4cn:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-5c4cn
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 600s
                 node.kubernetes.io/unreachable:NoExecute for 600s
Events:
  Type     Reason     Age                    From                     Message
  ----     ------     ----                   ----                     -------
  Normal   Scheduled  4m8s                   default-scheduler        Successfully assigned flink-operator-system/flink-operator-controller-manager-848b69b444-86bf2 to 10.148.145.115
  Normal   Pulled     4m6s                   kubelet, 10.148.145.115  Container image "gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0" already present on machine
  Normal   Created    4m6s                   kubelet, 10.148.145.115  Created container kube-rbac-proxy
  Normal   Started    4m6s                   kubelet, 10.148.145.115  Started container kube-rbac-proxy
  Normal   Pulled     4m6s                   kubelet, 10.148.145.115  Successfully pulled image "gcr.io/flink-operator/flink-operator:latest" in 233.095305ms
  Normal   Pulled     4m                     kubelet, 10.148.145.115  Successfully pulled image "gcr.io/flink-operator/flink-operator:latest" in 326.358312ms
  Normal   Pulled     3m42s                  kubelet, 10.148.145.115  Successfully pulled image "gcr.io/flink-operator/flink-operator:latest" in 241.20523ms
  Normal   Created    3m9s (x4 over 4m6s)    kubelet, 10.148.145.115  Created container flink-operator
  Normal   Started    3m9s (x4 over 4m5s)    kubelet, 10.148.145.115  Started container flink-operator
  Normal   Pulling    3m9s (x4 over 4m6s)    kubelet, 10.148.145.115  Pulling image "gcr.io/flink-operator/flink-operator:latest"
  Normal   Pulled     3m9s                   kubelet, 10.148.145.115  Successfully pulled image "gcr.io/flink-operator/flink-operator:latest" in 230.947675ms
  Warning  BackOff    2m40s (x6 over 3m53s)  kubelet, 10.148.145.115  Back-off restarting failed container
fallen-up commented 2 years ago

@sumchak1 see, your describe output:

 State:          Waiting
   Reason:       CrashLoopBackOff
 Last State:     Terminated
   Reason:       OOMKilled

Try change limits&requests: https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/helm-chart/flink-operator/templates/flink-operator.yaml#L346-L351