Azure / azure-databricks-operator

Kubernetes Operator for Databricks
MIT License
113 stars 48 forks source link

Crash when submitting djob and run simultaneously with nil pointer dereference #136

Closed joshagudo closed 4 years ago

joshagudo commented 4 years ago

If you submit a djob and run simultaneously the operator will crash with a nil pointer dereference error. Submitting simultaneously is our use-case that unfortunately we cannot avoid.

The desired behaviour would be similar to the handling of secretscopes when the underlying secret does not exist. Rather than crashing, report an error in the logs and continue the next reconcile cycle.

To reproduce given the following run referencing a djob:

apiVersion: databricks.microsoft.com/v1alpha1
kind: Djob
metadata:
  name: device-pessl
  namespace: dx
spec:
  new_cluster:
    spark_version: 5.3.x-scala2.11
    spark_conf:
      spark.databricks.delta.preview.enabled: "true"
    node_type_id: Standard_DS3_v2
    spark_env_vars:
      PYSPARK_PYTHON: '/databricks/python3/bin/python3'
    num_workers: 1
  notebook_task:
    notebook_path: "/Shared/notebooks/stream_builder-2.24.0"
  max_retries: 3

---
apiVersion: databricks.microsoft.com/v1alpha1
kind: Run
metadata:
  name: device-pessl-run
  namespace: dx
spec:
  job_name: device-pessl
  notebook_params:
    job_name: device-pessl

run: kubectl apply -f job_and_run.yaml

output:

2019-12-17T10:03:27.791+1100    INFO    controllers.Djob    Starting reconcile loop for dx/device-pessl
2019-12-17T10:03:27.791+1100    INFO    controllers.Djob    Submit for dx/device-pessl
2019-12-17T10:03:27.791+1100    INFO    controllers.Djob    Submitting job device-pessl
2019-12-17T10:03:27.821+1100    DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "run", "request": "dx/device-pessl-run"}
2019-12-17T10:03:27.822+1100    INFO    controllers.Run Submitting run device-pessl-run
2019-12-17T10:03:27.821+1100    DEBUG   controller-runtime.manager.events   Normal  {"object": {"kind":"Run","namespace":"dx","name":"device-pessl-run","uid":"d94ff46e-6f39-4759-a6ed-3a18525fbdeb","apiVersion":"databricks.microsoft.com/v1alpha1","resourceVersion":"56723"}, "reason": "Added", "message": "Object finalizer is added"}
E1217 10:03:27.822531   47051 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 357 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x201b080, 0x3062de0)
    /Users/d886442/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /Users/d886442/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/runtime/runtime.go:48 +0x82
panic(0x201b080, 0x3062de0)
    /usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/microsoft/azure-databricks-operator/controllers.(*RunReconciler).submit(0xc000290240, 0xc000278000, 0x1, 0x21ce575)
    /Users/d886442/projects/data-exchange/azure-databricks-operator/controllers/run_controller_databricks.go:71 +0x530
github.com/microsoft/azure-databricks-operator/controllers.(*RunReconciler).Reconcile(0xc000290240, 0xc0000cacfa, 0x2, 0xc0000cace0, 0x10, 0x307c760, 0x0, 0xc000054540, 0xc000105de8)
    /Users/d886442/projects/data-exchange/azure-databricks-operator/controllers/run_controller.go:80 +0x272
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001980c0, 0x20687a0, 0xc0001ceea0, 0x2068700)
    /Users/d886442/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256 +0x146
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001980c0, 0xc0005b2100)
    /Users/d886442/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232 +0xb5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0001980c0)
    /Users/d886442/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0005f5330)
    /Users/d886442/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0005f5330, 0x3b9aca00, 0x0, 0x1, 0xc0000ae300)
    /Users/d886442/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc0005f5330, 0x3b9aca00, 0xc0000ae300)
    /Users/d886442/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
    /Users/d886442/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:193 +0x326
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1e93220]

goroutine 357 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /Users/d886442/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/runtime/runtime.go:55 +0x105
panic(0x201b080, 0x3062de0)
    /usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/microsoft/azure-databricks-operator/controllers.(*RunReconciler).submit(0xc000290240, 0xc000278000, 0x1, 0x21ce575)
    /Users/d886442/projects/data-exchange/azure-databricks-operator/controllers/run_controller_databricks.go:71 +0x530
github.com/microsoft/azure-databricks-operator/controllers.(*RunReconciler).Reconcile(0xc000290240, 0xc0000cacfa, 0x2, 0xc0000cace0, 0x10, 0x307c760, 0x0, 0xc000054540, 0xc000105de8)
    /Users/d886442/projects/data-exchange/azure-databricks-operator/controllers/run_controller.go:80 +0x272
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001980c0, 0x20687a0, 0xc0001ceea0, 0x2068700)
    /Users/d886442/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256 +0x146
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001980c0, 0xc0005b2100)
    /Users/d886442/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232 +0xb5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0001980c0)
    /Users/d886442/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0005f5330)
    /Users/d886442/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0005f5330, 0x3b9aca00, 0x0, 0x1, 0xc0000ae300)
    /Users/d886442/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc0005f5330, 0x3b9aca00, 0xc0000ae300)
    /Users/d886442/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
    /Users/d886442/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:193
joshagudo commented 4 years ago

@Azadehkhojandi I'd be happy to work on this if thats ok? please let me know