cockroachdb / cockroach-operator

k8s operator for CRDB
Apache License 2.0
282 stars 94 forks source link

Fix bug with Jobs #529

Open chrislovecnm opened 3 years ago

chrislovecnm commented 3 years ago

Occasionally we are having a bug when we are looping to find a job. Here is the err I am getting

    logger.go:130: 2021-05-27T17:00:56.697Z WARN    job pod is ready    {"action": "Crdb Version Validator"}
    logger.go:130: 2021-05-27T17:00:56.782Z WARN    completed version checker   {"action": "Crdb Version Validator", "CrdbCluster": "crdb-test-pxntwh/crdb", "calVersion": "v20.2.8", "containerImage": "cockroachdb/cockroach:v20.2.8"}
    logger.go:130: 2021-05-27T17:00:56.782Z INFO    request was interrupted {"CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.782Z INFO    reconciling CockroachDB cluster {"CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.782Z INFO    Running action with index: 0 and  name: Decommission    {"CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.782Z WARN    check decommission oportunities {"action": "decommission", "CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.782Z INFO    replicas decommisioning {"action": "decommission", "CrdbCluster": "crdb-test-pxntwh/crdb", "status.CurrentReplicas": 3, "expected": 3}
    logger.go:130: 2021-05-27T17:00:56.782Z INFO    Running action with index: 1 and  name: VersionCheckerAction    {"CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.782Z WARN    starting to check the crdb version of the container provided    {"action": "Crdb Version Validator", "CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.782Z WARN    User set image.name, using that field instead of cockroachDBVersion {"action": "Crdb Version Validator", "CrdbCluster": "crdb-test-pxntwh/crdb"}
    logger.go:130: 2021-05-27T17:00:56.794Z ERROR   failed to reconcile job only err    {"action": "Crdb Version Validator", "CrdbCluster": "crdb-test-pxntwh/crdb", "error": "Job.batch \"crdb-vcheck-27035580\" is invalid: spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{\"app.kubernetes.io/component\":\"database\", \"app.kubernetes.io/instance\":\"crdb\", \"app.kubernetes.io/name\":\"cockroachdb\", \"controller-uid\":\"a0255182-4d4a-4c98-af46-1cf5eee46a3e\", \"job-name\":\"crdb-vcheck-27035580\"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume(nil), InitContainers:[]core.Container(nil), Containers:[]core.Container{core.Container{Name:\"crdb\", Image:\"cockroachdb/cockroach:v20.2.9\", Command:[]string{\"/bin/bash\"}, Args:[]string{\"-c\", \"/cockroach/cockroach.sh version | grep 'Build Tag:'| awk '{print $3}'; sleep 150\"}, WorkingDir:\"\", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar(nil), Resources:core.ResourceRequirements{Limits:core.ResourceList(nil), Requests:core.ResourceList(nil)}, VolumeMounts:[]core.VolumeMount(nil), VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:\"/dev/termination-log\", TerminationMessagePolicy:\"File\", ImagePullPolicy:\"IfNotPresent\", SecurityContext:(*core.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]core.EphemeralContainer(nil), RestartPolicy:\"Never\", TerminationGracePeriodSeconds:(*int64)(0xc0158f7b20), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:\"ClusterFirst\", NodeSelector:map[string]string(nil), ServiceAccountName:\"cockroach-database-sa\", AutomountServiceAccountToken:(*bool)(0xc0158f7b28), NodeName:\"\", SecurityContext:(*core.PodSecurityContext)(0xc01b036500), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:\"\", Subdomain:\"\", SetHostnameAsFQDN:(*bool)(nil), Affinity:(*core.Affinity)(nil), SchedulerName:\"default-scheduler\", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:\"\", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), Overhead:core.ResourceList(nil), EnableServiceLinks:(*bool)(nil), TopologySpreadConstraints:[]core.TopologySpreadConstraint(nil)}}: field is immutable"}
    logger.go:130: 2021-05-27T17:00:56.794Z WARN    version checker {"action": "Crdb Version Validator", "CrdbCluster": "crdb-test-pxntwh/crdb", "job": "crdb-vcheck-27035580"}
    logger.go:130: 2021-05-27T17:00:56.799Z WARN    job pod is ready    {"action": "Crdb Version Validator"}
    logger.go:130: 2021-05-27T17:00:56.883Z WARN    completed version checker   {"action": "Crdb Version Validator", "CrdbCluster": "crdb-test-pxntwh/crdb", "calVersion": "v20.2.8", "containerImage": "cockroachdb/cockroach:v20.2.8"}

We are recovering, but this will look weird to an end user.

chrislovecnm commented 3 years ago

@alinadonisa @keith-mcclellan PTAL

alinadonisa commented 3 years ago

@chrislovecnm the job has already Image:\"cockroachdb/cockroach:v20.2.9\ and you are reconciling for version "v20.2.8". What is the scenario that you are running? If you are running in parallel stuff, or reconcile in the same minute period it will generate the same timestamp and the name of the job will be the same on different runs.

chrislovecnm commented 3 years ago

It happens occasionally during running our e2e tests.

chrislovecnm commented 3 years ago

@davidwding can we close this?