Azure / azure-databricks-operator

Kubernetes Operator for Databricks
MIT License
113 stars 48 forks source link

Run CRD reports incorrect state information when Run fails in DataBricks #105

Closed storey247 closed 4 years ago

storey247 commented 4 years ago

Expected behaviour:

storey247 commented 4 years ago

Example of the JSON being returned:

{
    "metadata": {
        "job_id": 8,
        "run_id": 4,
        "number_in_job": 2,
        "original_attempt_run_id": 4,
        "state": {
            "life_cycle_state": "INTERNAL_ERROR",
            "result_state": "FAILED",
            "state_message": "Library installation failed for library jar: \"dbfs:/my-jar.jar\"\n. Error messages:\njava.lang.Throwable: java.io.FileNotFoundException: dbfs:/my-jar.jar"
        },
        "task": {
            "spark_jar_task": {
                "jar_uri": "",
                "main_class_name": "com.databricks.ComputeModels",
                "run_as_repl": true
            }
        },
        "cluster_spec": {
            "new_cluster": {
                "spark_version": "5.3.x-scala2.11",
                "node_type_id": "Standard_D3_v2",
                "enable_elastic_disk": true,
                "num_workers": 2
            },
            "libraries": [
                {
                    "jar": "dbfs:/my-jar.jar"
                },
                {
                    "maven": {
                        "coordinates": "org.jsoup:jsoup:1.7.2"
                    }
                }
            ]
        },
        "cluster_instance": {
            "cluster_id": "1115-062056-gofer677",
            "spark_context_id": "5848551514166302240"
        },
        "start_time": 1573798854850,
        "setup_duration": 283000,
        "execution_duration": 0,
        "cleanup_duration": 17000,
        "trigger": "RETRY",
        "creator_user_name": "alejacma@microsoft.com",
        "run_name": "test-job3-1573739737.9141724",
        "run_page_url": "https://westeurope.azuredatabricks.net/?o=4914192492341668#job/8/run/2",
        "run_type": "JOB_RUN"
    },
    "error": "Run result unavailable: job failed with error message\n Library installation failed for library jar: \"dbfs:/my-jar.jar\"\n. Error messages:\njava.lang.Throwable: java.io.FileNotFoundException: dbfs:/my-jar.jar"
}
storey247 commented 4 years ago

initial thoughts are that this line of code may be causing the issue: https://github.com/microsoft/azure-databricks-operator/blob/master/controllers/run_controller_databricks.go#L107

transformers-collector commented 4 years ago

This issue is also happening when the run finishes successfully in Databricks. Example of the JSON being returned:

{
  "metadata": {
    "job_id": 17,
    "run_id": 13,
    "number_in_job": 1,
    "state": {
      "life_cycle_state": "TERMINATED",
      "result_state": "SUCCESS",
      "state_message": ""
    },
    "task": {
      "spark_jar_task": {
        "jar_uri": "",
        "main_class_name": "org.apache.spark.examples.SparkPi",
        "parameters": [
          "10"
        ],
        "run_as_repl": true
      }
    },
    "cluster_spec": {
      "new_cluster": {
        "spark_version": "5.3.x-scala2.11",
        "node_type_id": "Standard_D3_v2",
        "enable_elastic_disk": true,
        "num_workers": 2
      },
      "libraries": [
        {
          "jar": "dbfs:/docs/sparkpi.jar"
        }
      ]
    },
    "cluster_instance": {
      "cluster_id": "1118-111134-haler704",
      "spark_context_id": "8830085876111149690"
    },
    "start_time": 1574075492789,
    "setup_duration": 330000,
    "execution_duration": 45000,
    "cleanup_duration": 9000,
    "creator_user_name": "alejacma@microsoft.com",
    "run_name": "test-run2",
    "run_page_url": "https://westeurope.azuredatabricks.net/?o=4914192492341668#job/17/run/1",
    "run_type": "SUBMIT_RUN"
  },
  "error": "Please refer to the logs for this cluster in the clusters page."
}

Note that even if the run TERMINATED with SUCCESS, and we can see the output of the jar file without issues, error value is still set.

stuartleeks commented 4 years ago

As @storey247 mentioned, this line is looking at whether there is an error status returned from the run status and turning that into an error in the reconcile loop.

I'm assuming that the behaviour here should be to not return an error from the reconcile loop but to update the run status to include the error that is in the api response so that users can run kubectl describe run myrun to see the run error?

stuartleeks commented 4 years ago

Picking this up against the current master code (commit 8a8145)

Deploy run from config/samples/databricks_v1alpha1_run_direct.yaml:

$ kubectl  apply  -f config/samples/databricks_v1alpha1_run_direct.yaml 
$ kubectl get run
NAME         AGE   RUNID   STATE
run-sample   4s    13      PENDING

After a while the job can be seen to be in the error state in the Databricks API.

export JOB_ID=13 # taken from kubectl get run output above
curl -H "Authorization: Bearer $DATABRICKS_TOKEN" $DATABRICKS_HOST/api/2.0/jobs/runs/get?run_id=$JOB_ID

Which gives output such as the following (Note the life_cycle_state: INTERNAL_ERROR)

{
  "job_id": 22,
  "run_id": 13,
  "number_in_job": 1,
  "state": {
    "life_cycle_state": "INTERNAL_ERROR",
    "result_state": "FAILED",
    "state_message": "Library installation failed for library jar: \"dbfs:/my-jar.jar\"\n. Error messages:\njava.lang.Throwable: java.io.FileNotFoundException: dbfs:/my-jar.jar"
  },
  // rest omitted for brevity
}

At this point the run status is still shown as PENDING (i.e. not reflecting the INTERNAL_ERROR):

$ kubectl get run -w
NAME         AGE   RUNID   STATE
run-sample   8m    13      PENDING

And the operator logs have this error repeating:

2019-12-02T15:42:38.066Z    ERROR   controller-runtime.controller   Reconciler error    {"controller": "run", "request": "default/run-sample", "error": "error when refreshing run: No output is available until the task begins."}
github.com/go-logr/zapr.(*zapLogger).Error
    /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
    /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88