Failed to update job status for new job submission: status.components.job.id: required value

jaredstehler commented 3 years ago

Seeing jobs fail to start from the operator, with following in operator logs (below). Seems related to PR #379

2020-12-16T21:18:24.724Z        INFO    controllers.FlinkCluster        ---------- 4. Take actions ----------
   {"cluster": "doolittle-dev/flink-tailpipe-ingester"}
2020-12-16T21:18:24.724Z        INFO    controllers.FlinkCluster        ConfigMap already exists, no action
     {"cluster": "doolittle-dev/flink-tailpipe-ingester"}
2020-12-16T21:18:24.724Z        INFO    controllers.FlinkCluster        Statefulset already exists, no action
   {"cluster": "doolittle-dev/flink-tailpipe-ingester", "component": "JobManager"}
2020-12-16T21:18:24.724Z        INFO    controllers.FlinkCluster        JobManager service already exists, no action    {"cluster": "doolittle-dev/flink-tailpipe-ingester"}
2020-12-16T21:18:24.724Z        INFO    controllers.FlinkCluster        Statefulset already exists, no action
   {"cluster": "doolittle-dev/flink-tailpipe-ingester", "component": "TaskManager"}
2020-12-16T21:18:24.724Z        INFO    controllers.FlinkCluster        Updating job status to create new job submitter {"cluster": "doolittle-dev/flink-tailpipe-ingester"}
2020-12-16T21:18:24.744Z        ERROR   controllers.FlinkCluster        Failed to update job status for new job submission      {"cluster": "doolittle-dev/flink-tailpipe-ingester", "error": "FlinkCluster.flinkoperator.k8s.io \"flink-tailpipe-ingester\" is invalid: [status.components.job.id: Required value, status.components.job.name: Required value]", "error": "FlinkCluster.flinkoperator.k8s.io \"flink-tailpipe-ingester\" is invalid: [status.components.job.id: Required value, status.components.job.name: Required value]"}
github.com/go-logr/zapr.(*zapLogger).Error
        /root/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/googlecloudplatform/flink-operator/controllers.(*ClusterReconciler).updateStatusForNewJob
        /workspace/controllers/flinkcluster_reconciler.go:955
github.com/googlecloudplatform/flink-operator/controllers.(*ClusterReconciler).reconcileJob
        /workspace/controllers/flinkcluster_reconciler.go:448
github.com/googlecloudplatform/flink-operator/controllers.(*ClusterReconciler).reconcile
        /workspace/controllers/flinkcluster_reconciler.go:111
github.com/googlecloudplatform/flink-operator/controllers.(*FlinkClusterHandler).reconcile
        /workspace/controllers/flinkcluster_controller.go:220
github.com/googlecloudplatform/flink-operator/controllers.(*FlinkClusterReconciler).Reconcile
        /workspace/controllers/flinkcluster_controller.go:82
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
        /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
        /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
        /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
        /root/go/pkg/mod/k8s.io/apimachinery@v0.18.3/pkg/util/wait/wait.go:90

shashken commented 3 years ago

Same here, @elanv @functicons any idea why did is happening?

elanv commented 3 years ago

status.job.id has been changed to optional from required. Perhaps it will be solved by updating the CRD. It would be nice to check if the field has changed in the CRD.

(https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/379/files#diff-02504b69dc5bab964bf51ad074ccf4b44a22ffb1e3592c93397d26460963e34aL5163-L5170)

jaredstehler commented 3 years ago

But this isn't a user-specified field, so this would be a bug in the operator?

The CRD diffs actually show these two fields switching from required to optional?

 type JobStatus struct {
        // The name of the Kubernetes job resource.
-       Name string `json:"name"`
+       Name string `json:"name,omitempty"`

        // The ID of the Flink job.
-       ID string `json:"id"`
+       ID string `json:"id,omitempty"`

On Wed, Dec 16, 2020 at 7:17 PM Eui Heo notifications@github.com wrote:

status.job.id has been changed to optional from required. Perhaps it will be solved by updating the CRD. It would be nice to check if the field has changed in the CRD.

( https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/379/files#diff-02504b69dc5bab964bf51ad074ccf4b44a22ffb1e3592c93397d26460963e34aL5163-L5170 )

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/issues/385#issuecomment-747118505, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPNWSQPFBTDGWTZUD7TUPDSVFE3LANCNFSM4U6UEN2A .

elanv commented 3 years ago

You can find if the CRD is updated like:

$ kubectl get crd flinkclusters.flinkoperator.k8s.io -o jsonpath='{.spec.versions[?(@.name=="v1beta1")].schema.openAPIV3Schema.properties.status.properties.components.properties.job}'

{"properties":{"fromSavepoint":{"type":"string"},"id":{"type":"string"},"lastSavepointTime":{"type":"string"},"lastSavepointTriggerID":{"type":"string"},"name":{"type":"string"},"restartCount":{"format":"int32","type":"integer"},"savepointGeneration":{"format":"int32","type":"integer"},"savepointLocation":{"type":"string"},"state":{"type":"string"}},"required":["state"],"type":"object"}

If updated, required should look: "required":["state"]

If the CRD was not updated yet, you could update your CRD like: make install

shashken commented 3 years ago

Found the problem I think, you forgot to update the crd in the helm chart
Can you take a look there and update the crd please? @elanv

elanv commented 3 years ago

Found the problem I think, you forgot to update the crd in the helm chart Can you take a look there and update the crd please? @elanv

@shashken I missed it. Thanks. I made the PR #386.

GoogleCloudPlatform / flink-on-k8s-operator

Failed to update job status for new job submission: status.components.job.id: required value #385