Run CRD reports incorrect state information when Run fails in DataBricks

storey247 commented 4 years ago

Create a job & run using sample databricks_v1alpha1_djob.yaml
The job will not work as the jar file specified in the spec is invalid.
Try and create a Run object for this job using the databricks_v1alpha1_run_job.yaml
The run will fail because the jar file is invalid, but the Operator throws an exception and never updates its state.

Log shows the following error:

2019-11-12T14:36:56.612Z        ERROR   controller-runtime.controller   Reconciler error        {"controller": "run", "request": "default/run-sample", "error": "error when refreshing run: Run result unavailable: job failed with error message\n Library installation failed for library jar: \"dbfs:/my-jar.jar\"\n. Error messages:\njava.lang.Throwable: java.io.FileNotFoundException: dbfs:/my-jar.jar"}
github.com/go-logr/zapr.(*zapLogger).Error
    /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88

This error then proceeds to keep recurring in the logs

Expected behaviour:

No error reported
Correct status reported

storey247 commented 4 years ago

Example of the JSON being returned:

{
    "metadata": {
        "job_id": 8,
        "run_id": 4,
        "number_in_job": 2,
        "original_attempt_run_id": 4,
        "state": {
            "life_cycle_state": "INTERNAL_ERROR",
            "result_state": "FAILED",
            "state_message": "Library installation failed for library jar: \"dbfs:/my-jar.jar\"\n. Error messages:\njava.lang.Throwable: java.io.FileNotFoundException: dbfs:/my-jar.jar"
        },
        "task": {
            "spark_jar_task": {
                "jar_uri": "",
                "main_class_name": "com.databricks.ComputeModels",
                "run_as_repl": true
            }
        },
        "cluster_spec": {
            "new_cluster": {
                "spark_version": "5.3.x-scala2.11",
                "node_type_id": "Standard_D3_v2",
                "enable_elastic_disk": true,
                "num_workers": 2
            },
            "libraries": [
                {
                    "jar": "dbfs:/my-jar.jar"
                },
                {
                    "maven": {
                        "coordinates": "org.jsoup:jsoup:1.7.2"
                    }
                }
            ]
        },
        "cluster_instance": {
            "cluster_id": "1115-062056-gofer677",
            "spark_context_id": "5848551514166302240"
        },
        "start_time": 1573798854850,
        "setup_duration": 283000,
        "execution_duration": 0,
        "cleanup_duration": 17000,
        "trigger": "RETRY",
        "creator_user_name": "alejacma@microsoft.com",
        "run_name": "test-job3-1573739737.9141724",
        "run_page_url": "https://westeurope.azuredatabricks.net/?o=4914192492341668#job/8/run/2",
        "run_type": "JOB_RUN"
    },
    "error": "Run result unavailable: job failed with error message\n Library installation failed for library jar: \"dbfs:/my-jar.jar\"\n. Error messages:\njava.lang.Throwable: java.io.FileNotFoundException: dbfs:/my-jar.jar"
}

storey247 commented 4 years ago

initial thoughts are that this line of code may be causing the issue: https://github.com/microsoft/azure-databricks-operator/blob/master/controllers/run_controller_databricks.go#L107

transformers-collector commented 4 years ago

This issue is also happening when the run finishes successfully in Databricks. Example of the JSON being returned:

{
  "metadata": {
    "job_id": 17,
    "run_id": 13,
    "number_in_job": 1,
    "state": {
      "life_cycle_state": "TERMINATED",
      "result_state": "SUCCESS",
      "state_message": ""
    },
    "task": {
      "spark_jar_task": {
        "jar_uri": "",
        "main_class_name": "org.apache.spark.examples.SparkPi",
        "parameters": [
          "10"
        ],
        "run_as_repl": true
      }
    },
    "cluster_spec": {
      "new_cluster": {
        "spark_version": "5.3.x-scala2.11",
        "node_type_id": "Standard_D3_v2",
        "enable_elastic_disk": true,
        "num_workers": 2
      },
      "libraries": [
        {
          "jar": "dbfs:/docs/sparkpi.jar"
        }
      ]
    },
    "cluster_instance": {
      "cluster_id": "1118-111134-haler704",
      "spark_context_id": "8830085876111149690"
    },
    "start_time": 1574075492789,
    "setup_duration": 330000,
    "execution_duration": 45000,
    "cleanup_duration": 9000,
    "creator_user_name": "alejacma@microsoft.com",
    "run_name": "test-run2",
    "run_page_url": "https://westeurope.azuredatabricks.net/?o=4914192492341668#job/17/run/1",
    "run_type": "SUBMIT_RUN"
  },
  "error": "Please refer to the logs for this cluster in the clusters page."
}

Note that even if the run TERMINATED with SUCCESS, and we can see the output of the jar file without issues, error value is still set.

stuartleeks commented 4 years ago

As @storey247 mentioned, this line is looking at whether there is an error status returned from the run status and turning that into an error in the reconcile loop.

I'm assuming that the behaviour here should be to not return an error from the reconcile loop but to update the run status to include the error that is in the api response so that users can run kubectl describe run myrun to see the run error?

stuartleeks commented 4 years ago

Picking this up against the current master code (commit 8a8145)

Deploy run from config/samples/databricks_v1alpha1_run_direct.yaml:

$ kubectl  apply  -f config/samples/databricks_v1alpha1_run_direct.yaml 
$ kubectl get run
NAME         AGE   RUNID   STATE
run-sample   4s    13      PENDING

After a while the job can be seen to be in the error state in the Databricks API.

export JOB_ID=13 # taken from kubectl get run output above
curl -H "Authorization: Bearer $DATABRICKS_TOKEN" $DATABRICKS_HOST/api/2.0/jobs/runs/get?run_id=$JOB_ID

Which gives output such as the following (Note the life_cycle_state: INTERNAL_ERROR)

{
  "job_id": 22,
  "run_id": 13,
  "number_in_job": 1,
  "state": {
    "life_cycle_state": "INTERNAL_ERROR",
    "result_state": "FAILED",
    "state_message": "Library installation failed for library jar: \"dbfs:/my-jar.jar\"\n. Error messages:\njava.lang.Throwable: java.io.FileNotFoundException: dbfs:/my-jar.jar"
  },
  // rest omitted for brevity
}

At this point the run status is still shown as PENDING (i.e. not reflecting the INTERNAL_ERROR):

$ kubectl get run -w
NAME         AGE   RUNID   STATE
run-sample   8m    13      PENDING

And the operator logs have this error repeating:

2019-12-02T15:42:38.066Z    ERROR   controller-runtime.controller   Reconciler error    {"controller": "run", "request": "default/run-sample", "error": "error when refreshing run: No output is available until the task begins."}
github.com/go-logr/zapr.(*zapLogger).Error
    /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88

Azure / azure-databricks-operator

Run CRD reports incorrect state information when Run fails in DataBricks #105