Closed storey247 closed 4 years ago
Example of the JSON being returned:
{
"metadata": {
"job_id": 8,
"run_id": 4,
"number_in_job": 2,
"original_attempt_run_id": 4,
"state": {
"life_cycle_state": "INTERNAL_ERROR",
"result_state": "FAILED",
"state_message": "Library installation failed for library jar: \"dbfs:/my-jar.jar\"\n. Error messages:\njava.lang.Throwable: java.io.FileNotFoundException: dbfs:/my-jar.jar"
},
"task": {
"spark_jar_task": {
"jar_uri": "",
"main_class_name": "com.databricks.ComputeModels",
"run_as_repl": true
}
},
"cluster_spec": {
"new_cluster": {
"spark_version": "5.3.x-scala2.11",
"node_type_id": "Standard_D3_v2",
"enable_elastic_disk": true,
"num_workers": 2
},
"libraries": [
{
"jar": "dbfs:/my-jar.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
]
},
"cluster_instance": {
"cluster_id": "1115-062056-gofer677",
"spark_context_id": "5848551514166302240"
},
"start_time": 1573798854850,
"setup_duration": 283000,
"execution_duration": 0,
"cleanup_duration": 17000,
"trigger": "RETRY",
"creator_user_name": "alejacma@microsoft.com",
"run_name": "test-job3-1573739737.9141724",
"run_page_url": "https://westeurope.azuredatabricks.net/?o=4914192492341668#job/8/run/2",
"run_type": "JOB_RUN"
},
"error": "Run result unavailable: job failed with error message\n Library installation failed for library jar: \"dbfs:/my-jar.jar\"\n. Error messages:\njava.lang.Throwable: java.io.FileNotFoundException: dbfs:/my-jar.jar"
}
initial thoughts are that this line of code may be causing the issue: https://github.com/microsoft/azure-databricks-operator/blob/master/controllers/run_controller_databricks.go#L107
This issue is also happening when the run finishes successfully in Databricks. Example of the JSON being returned:
{
"metadata": {
"job_id": 17,
"run_id": 13,
"number_in_job": 1,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"task": {
"spark_jar_task": {
"jar_uri": "",
"main_class_name": "org.apache.spark.examples.SparkPi",
"parameters": [
"10"
],
"run_as_repl": true
}
},
"cluster_spec": {
"new_cluster": {
"spark_version": "5.3.x-scala2.11",
"node_type_id": "Standard_D3_v2",
"enable_elastic_disk": true,
"num_workers": 2
},
"libraries": [
{
"jar": "dbfs:/docs/sparkpi.jar"
}
]
},
"cluster_instance": {
"cluster_id": "1118-111134-haler704",
"spark_context_id": "8830085876111149690"
},
"start_time": 1574075492789,
"setup_duration": 330000,
"execution_duration": 45000,
"cleanup_duration": 9000,
"creator_user_name": "alejacma@microsoft.com",
"run_name": "test-run2",
"run_page_url": "https://westeurope.azuredatabricks.net/?o=4914192492341668#job/17/run/1",
"run_type": "SUBMIT_RUN"
},
"error": "Please refer to the logs for this cluster in the clusters page."
}
Note that even if the run TERMINATED with SUCCESS, and we can see the output of the jar file without issues, error value is still set.
As @storey247 mentioned, this line is looking at whether there is an error status returned from the run status and turning that into an error in the reconcile loop.
I'm assuming that the behaviour here should be to not return an error from the reconcile loop but to update the run status to include the error that is in the api response so that users can run kubectl describe run myrun
to see the run error?
Picking this up against the current master
code (commit 8a8145
)
Deploy run from config/samples/databricks_v1alpha1_run_direct.yaml
:
$ kubectl apply -f config/samples/databricks_v1alpha1_run_direct.yaml
$ kubectl get run
NAME AGE RUNID STATE
run-sample 4s 13 PENDING
After a while the job can be seen to be in the error state in the Databricks API.
export JOB_ID=13 # taken from kubectl get run output above
curl -H "Authorization: Bearer $DATABRICKS_TOKEN" $DATABRICKS_HOST/api/2.0/jobs/runs/get?run_id=$JOB_ID
Which gives output such as the following (Note the life_cycle_state: INTERNAL_ERROR
)
{
"job_id": 22,
"run_id": 13,
"number_in_job": 1,
"state": {
"life_cycle_state": "INTERNAL_ERROR",
"result_state": "FAILED",
"state_message": "Library installation failed for library jar: \"dbfs:/my-jar.jar\"\n. Error messages:\njava.lang.Throwable: java.io.FileNotFoundException: dbfs:/my-jar.jar"
},
// rest omitted for brevity
}
At this point the run status is still shown as PENDING
(i.e. not reflecting the INTERNAL_ERROR
):
$ kubectl get run -w
NAME AGE RUNID STATE
run-sample 8m 13 PENDING
And the operator logs have this error repeating:
2019-12-02T15:42:38.066Z ERROR controller-runtime.controller Reconciler error {"controller": "run", "request": "default/run-sample", "error": "error when refreshing run: No output is available until the task begins."}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.0-beta.4/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88
Create a job & run using sample
databricks_v1alpha1_djob.yaml
The job will not work as the jar file specified in the spec is invalid.
Try and create a Run object for this job using the
databricks_v1alpha1_run_job.yaml
The run will fail because the jar file is invalid, but the Operator throws an exception and never updates its state.
Log shows the following error:
This error then proceeds to keep recurring in the logs
Expected behaviour: