[ISSUE] jobs.get_run does not populate cluster_instance attribute of Run

PadenZach commented 1 year ago

Description

We're using the python sdk to get information of our runs; however, cluster_instance is not being populated as expected. It's being reported as None in the object, however, the cluster exists, is running, and the cluster_id can be fetched via alternative ways.

Reproduction

from databricks.sdk import WorkspaceClient
wc = WorkspaceClient()
run = wc.jobs.get_run(<my_job_run_id>)
run.cluster_instance # BUG: This will evaluate to None.
run.tasks[0].cluster_instance.cluster_id # this will return the correct id.

Expected behavior

A running job with a active cluster should always have a cluster_instance object.

Debug Logs

DEBUG:urllib3.connectionpool:Resetting dropped connection: OUR_URL.cloud.databricks.com
DEBUG:urllib3.connectionpool:https://OUR_URL.cloud.databricks.com:443 "GET /api/2.1/jobs/runs/get?run_id=SNIP HTTP/1.1" 200 None
DEBUG:databricks.sdk:GET /api/2.1/jobs/runs/get?run_id=SNIP
< 200 OK
< {
<   "cleanup_duration": 0,
<   "creator_user_name": "our username",
<   "end_time": 1695861714549,
<   "execution_duration": 188000,
<   "format": "MULTI_TASK",
<   "job_id": SNIP,
<   "number_in_job": SNIP,
<   "run_id": SNIP,
<   "run_name": "run_name",
<   "run_page_url": "https://OUR_URL.cloud.databricks.com/?o=7377027982187242#job/1018422366429564/run/(Our id)... (2 more bytes)",
<   "run_type": "SUBMIT_RUN",
<   "setup_duration": 343000,
<   "start_time": 1695861182715,
<   "state": {
<     "life_cycle_state": "TERMINATED",
<     "result_state": "SUCCESS",
<     "state_message": "",
<     "user_cancelled_or_timedout": false
<   },
<   "tasks": [
<     {
<       "attempt_number": 0,
<       "cleanup_duration": 0,
<       "cluster_instance": {
<         "cluster_id": "our_cluster_id",
<         "spark_context_id": "3576515613131550864"
<       },
<       "end_time": 1695861714429,
<       "execution_duration": 188000,
<       "libraries": [
<         {
<           "pypi": {
<             "package": "boto3-stubs[essential, glue, emr, dynamodb, ec2, ssm]"
<           }
<         },
<         "... (19 additional elements)"
<       ],
<       "new_cluster": {
<         "aws_attributes": {
<           "availability": "SPOT_WITH_FALLBACK",
<           "first_on_demand": 1,
<           "instance_profile_arn": "arn:aws:iam::SNIP",
<           "zone_id": "us-east-1f"
<         },
<         "cluster_log_conf": {
<           "dbfs": {
<             "destination": "dbfs:/SNIP"
<           }
<         },
<         "custom_tags": {
<           "team": "dataeng"
<         },
<         "data_security_mode": "SINGLE_USER",
<         "enable_elastic_disk": true,
<         "init_scripts": [
<           {
<             "s3": {
<               "destination": "SNIP (5 more bytes)",
<               "endpoint": "https://s3.amazonaws.com",
<               "region": "us-east-1"
<             }
<           }
<         ],
<         "node_type_id": "m6id.16xlarge",
<         "num_workers": 3,
<         "spark_conf": {
<           "spark.databricks.delta.allowArbitraryProperties.enabled": "true",
<           "spark.kryo.registrator": "org.apache.sedona.core.serde.SedonaKryoRegistrator",
<           "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
<           "spark.sql.shuffle.partitions": "600"
<         },
<         "spark_version": "13.3.x-photon-scala2.12"
<       },
<       "run_id": SNIP,
<       "run_if": "ALL_SUCCESS",
<       "run_page_url": "https://dbc-SNIP.cloud.databricks.com/?o=SNIP#job/SNIP/run/SNIP... (2 more bytes)",
<       "setup_duration": 343000,
<       "spark_python_task": {
<         "parameters": [
<           "SNIP",
<           "... (2 additional elements)"
<         ],
<         "python_file": "SNIP"
<       },
<       "start_time": 1695861182724,
<       "state": {
<         "life_cycle_state": "TERMINATED",
<         "result_state": "SUCCESS",
<         "state_message": "",
<         "user_cancelled_or_timedout": false
<       },
<       "task_key": "dagster-task"
<     }
<   ]
< }

Other Information

Additional context

mgyucht commented 1 year ago

Hi @PadenZach, thanks for reporting this. I suspect this is because the SDK uses the 2.1 multitask job API, so it may be by design. Where are you getting the expectation that that field should be populated?

PadenZach commented 1 year ago

Databricks API docs suggest that this should be set:

cluster_id string

The canonical identifier for the cluster used by a run. This field is always available for runs on existing clusters. For runs on new clusters, it becomes available once the cluster is created. This value can be used to view logs by browsing to /#setting/sparkui/$cluster_id/driver-logs. The logs continue to be available after the run completes.

The response won’t include this field if the identifier is not available yet.

If the cluster is on and running, and a cluster id can be viewed for it in the UI I dont see why it wouldnt be available yet via api/sdk. If it's not specified here, when should it actually be populated?

Our use case for it is to:

Launch a job run via the sdk (specifically we use jobs.submit)
Get the cluster id from the job run returned by (1)
Modify the cluster perms, poll cluster events, etc.

mgyucht commented 1 year ago

It should definitely be available from the SDK. The issue is that different tasks can run on different clusters, so for multi-task jobs that use multiple clusters, the cluster_instance field doesn't have a sane default. I'll raise this with the jobs team to revisit the documentation to clarify.

PadenZach commented 1 year ago

Ahhh I see. I think another thing that made this confusing was that all of our jobs only ever had a single task, so we assumed this shouldn't be the case.

It seems that no jobs that are submitted via the the 2.1 submit run with new clusters end up being expressed as a "multi-task job".

databricks / databricks-sdk-py

[ISSUE] jobs.get_run does not populate cluster_instance attribute of Run #370