Terminal states are TERMINATED, SKIPPED AND INTERNAL_ERROR. The other states are PENDING, RUNNING AND TERMINATING. But in certain circumstances I see that a Run (which i can see inside DB using the webui) is still showing as State: <blank> and RunId: <blank>
Now, the issue:
This condition works most of the time, but sometimes it doesn't. When the issue happens, the logs show the the following error:
2019-11-22T10:03:39.781Z INFO controllers.Run Refreshing run test-run
2019-11-22T10:03:49.781Z ERROR controller-runtime.controller Reconciler error {"controller": "run", "request": "kubeflow/test-run", "error": "error when refreshing run: Get https://westeurope.azuredatabricks.net/api/2.0/jobs/runs/get-output?run_id=18: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88
So I guess there is some bug in the operator that is not setting life_cycle_state to a valid value when getting that exception and the step to create the run is not updating the K8s status correctly in an error state
There are several lifecycle states for a run: https://docs.databricks.com/dev-tools/api/latest/jobs.html#runlifecyclestate.
Terminal states are TERMINATED, SKIPPED AND INTERNAL_ERROR. The other states are PENDING, RUNNING AND TERMINATING. But in certain circumstances I see that a Run (which i can see inside DB using the webui) is still showing as
State: <blank>
andRunId: <blank>
Now, the issue: This condition works most of the time, but sometimes it doesn't. When the issue happens, the logs show the the following error:
So I guess there is some bug in the operator that is not setting life_cycle_state to a valid value when getting that exception and the step to create the run is not updating the K8s status correctly in an error state