GoogleCloudPlatform / flink-on-k8s-operator

[DEPRECATED] Kubernetes operator for managing the lifecycle of Apache Flink and Beam applications.
Apache License 2.0
657 stars 265 forks source link

FlinkOperator crashes when deploying a new Job in a FlinkCluster #408

Open morelina opened 3 years ago

morelina commented 3 years ago

I am trying to update a running Job in my Flink Job Cluster. I am using commit which is still in PR with some fixes: https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/401/commits/72e89b2684ada58ec9a4987d49f902841a86607b

FlinkOperator triggers the Savepoint and it is successfully created. However, flink-operator crashes immediately after.

These are the events in FlinkCluster: Normal SavepointCreated 17m FlinkOperator Successfully savepoint created Normal SavepointTriggered 12m FlinkOperator Triggered savepoint for update: triggerID 590c343c5e3934e4996e5904b719cf17. Normal SavepointCreated 12m FlinkOperator Successfully savepoint created Normal SavepointTriggered 7m1s FlinkOperator Triggered savepoint for update: triggerID 51660f2c77254db025d37e23c0fa57e7. Normal SavepointCreated 6m56s FlinkOperator Successfully savepoint created Normal SavepointTriggered 107s FlinkOperator Triggered savepoint for update: triggerID ffc1ff58f0ee872a278fab5b

And these are the logs from the crash:

controllers.FlinkCluster ---------- 4. Take actions ---------- {"cluster": "namespace-a/cluster-a"} controllers.FlinkCluster ConfigMap already exists, no action {"cluster": "namespace-a/cluster-a"} controllers.FlinkCluster Statefulset already exists, no action {"cluster": "namespace-a/cluster-a", "component": "JobManager"} controllers.FlinkCluster JobManager service already exists, no action {"cluster": "namespace-a/cluster-a"} controllers.FlinkCluster Statefulset already exists, no action {"cluster": "namespace-a/cluster-a", "component": "TaskManager"} controllers.FlinkCluster Job is about to be restarted to update {"cluster": "namespace-a/cluster-a"} E0208 17:15:08.662342 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 362 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1422fc0, 0x2241f50) /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/runtime/runtime.go:74 +0xa3 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/runtime/runtime.go:48 +0x82 panic(0x1422fc0, 0x2241f50) /usr/local/go/src/runtime/panic.go:969 +0x166 github.com/googlecloudplatform/flink-operator/controllers.(ClusterReconciler).reconcileJob(0xc001cad4a0, 0x15c2f00, 0x0, 0x0, 0x0) /workspace/controllers/flinkcluster_reconciler.go:511 +0x5fb github.com/googlecloudplatform/flink-operator/controllers.(ClusterReconciler).reconcile(0xc001cad4a0, 0xc001cad4a0, 0x25, 0x0, 0x0) /workspace/controllers/flinkcluster_reconciler.go:111 +0x223 github.com/googlecloudplatform/flink-operator/controllers.(FlinkClusterHandler).reconcile(0xc000d55b28, 0xc0012a7350, 0x10, 0xc00127db20, 0x1d, 0xc0013e6800, 0x36e54d7482f9f143, 0x13b94c0, 0xc0013e6770) /workspace/controllers/flinkcluster_controller.go:220 +0xb91 github.com/googlecloudplatform/flink-operator/controllers.(FlinkClusterReconciler).Reconcile(0xc0007165a0, 0xc0012a7350, 0x10, 0xc00127db20, 0x1d, 0x0, 0xc0007a47273aff35, 0xc000724360, 0xc000724128) /workspace/controllers/flinkcluster_controller.go:82 +0x249 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler(0xc0000c89c0, 0x1475780, 0xc0013e6760, 0x0) /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:256 +0x161 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem(0xc0000c89c0, 0xc00051ee00) /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:232 +0xae sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).worker(0xc0000c89c0) /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:211 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00007a950) /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007a950, 0x17b1ee0, 0xc0004362d0, 0x1668101, 0xc000114360) /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:156 +0xa3 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00007a950, 0x3b9aca00, 0x0, 0x1, 0xc000114360) /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:133 +0x98 k8s.io/apimachinery/pkg/util/wait.Until(0xc00007a950, 0x3b9aca00, 0xc000114360) /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:90 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1 /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:193 +0x305 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x12eee3b]

goroutine 362 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/runtime/runtime.go:55 +0x105 panic(0x1422fc0, 0x2241f50) /usr/local/go/src/runtime/panic.go:969 +0x166 github.com/googlecloudplatform/flink-operator/controllers.(ClusterReconciler).reconcileJob(0xc001cad4a0, 0x15c2f00, 0x0, 0x0, 0x0) /workspace/controllers/flinkcluster_reconciler.go:511 +0x5fb github.com/googlecloudplatform/flink-operator/controllers.(ClusterReconciler).reconcile(0xc001cad4a0, 0xc001cad4a0, 0x25, 0x0, 0x0) /workspace/controllers/flinkcluster_reconciler.go:111 +0x223 github.com/googlecloudplatform/flink-operator/controllers.(FlinkClusterHandler).reconcile(0xc000d55b28, 0xc0012a7350, 0x10, 0xc00127db20, 0x1d, 0xc0013e6800, 0x36e54d7482f9f143, 0x13b94c0, 0xc0013e6770) /workspace/controllers/flinkcluster_controller.go:220 +0xb91 github.com/googlecloudplatform/flink-operator/controllers.(FlinkClusterReconciler).Reconcile(0xc0007165a0, 0xc0012a7350, 0x10, 0xc00127db20, 0x1d, 0x0, 0xc0007a47273aff35, 0xc000724360, 0xc000724128) /workspace/controllers/flinkcluster_controller.go:82 +0x249 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler(0xc0000c89c0, 0x1475780, 0xc0013e6760, 0x0) /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:256 +0x161 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem(0xc0000c89c0, 0xc00051ee00) /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:232 +0xae sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).worker(0xc0000c89c0) /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:211 +0x2b k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00007a950) /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:155 +0x5f k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00007a950, 0x17b1ee0, 0xc0004362d0, 0x1668101, 0xc000114360) /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:156 +0xa3 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00007a950, 0x3b9aca00, 0x0, 0x1, 0xc000114360) /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:133 +0x98 k8s.io/apimachinery/pkg/util/wait.Until(0xc00007a950, 0x3b9aca00, 0xc000114360) /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:90 +0x4d created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func1 /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:193 +0x305

elanv commented 3 years ago

It seems to be caused by the newly added field. A workaround would be to set the value of spec.job.takeSavepointOnUpgrade to true.