Azure / azure-databricks-operator

Kubernetes Operator for Databricks
MIT License
113 stars 48 forks source link

Sl/105 run refresh #118

Closed stuartleeks closed 4 years ago

stuartleeks commented 4 years ago

Fixes #105

To address the issue with not being able to reconcile runs(#105) this PR removes the check that propagated an error that was reported as part of the run state from the API to be a reconciler error. WIth this PR the reconciler now completes successfully and the run status is reported in the status for the run CRD instance (and availble via kubectl describe run myrun

Azadehkhojandi commented 4 years ago

Can you please cherry-pick your changes related to #105 and send PR separately?

stuartleeks commented 4 years ago

/azp run

azure-pipelines[bot] commented 4 years ago
Azure Pipelines successfully started running 1 pipeline(s).
stuartleeks commented 4 years ago

I've rebased this on master and re-tested following the steps outlined in the PR

$ kubectl get run
NAME         AGE     RUNID   STATE
run-sample   2m23s   15      PENDING

After a while, the job is in the error state:

export JOB_ID=15 # taken from kubectl get run output above
curl -H "Authorization: Bearer $DATABRICKS_TOKEN" $DATABRICKS_HOST/api/2.0/jobs/runs/get?run_id=$JOB_ID
{
  "job_id": 24,
  "run_id": 15,
  "number_in_job": 1,
  "state": {
    "life_cycle_state": "INTERNAL_ERROR",
    "result_state": "FAILED",
    "state_message": "Library installation failed for library jar: \"dbfs:/my-jar.jar\"\n. Error messages:\njava.lang.Throwable: java.io.FileNotFoundException: dbfs:/my-jar.jar"
  },
// rest omitted for brevity

Once the operator has next refreshed the INTERNAL_ERROR status is reflected in the run status:

$ kubectl get run
NAME         AGE   RUNID   STATE
run-sample   10m   15      INTERNAL_ERROR

Observing the operator logs shows that the status is still being reconciled every 30 seconds.