Open PadenZach opened 1 year ago
Hi @PadenZach, thanks for reporting this. I suspect this is because the SDK uses the 2.1 multitask job API, so it may be by design. Where are you getting the expectation that that field should be populated?
Databricks API docs suggest that this should be set:
cluster_id string
The canonical identifier for the cluster used by a run. This field is always available for runs on existing clusters. For runs on new clusters, it becomes available once the cluster is created. This value can be used to view logs by browsing to /#setting/sparkui/$cluster_id/driver-logs. The logs continue to be available after the run completes.
The response won’t include this field if the identifier is not available yet.
If the cluster is on and running, and a cluster id can be viewed for it in the UI I dont see why it wouldnt be available yet via api/sdk. If it's not specified here, when should it actually be populated?
Our use case for it is to:
It should definitely be available from the SDK. The issue is that different tasks can run on different clusters, so for multi-task jobs that use multiple clusters, the cluster_instance field doesn't have a sane default. I'll raise this with the jobs team to revisit the documentation to clarify.
Ahhh I see. I think another thing that made this confusing was that all of our jobs only ever had a single task, so we assumed this shouldn't be the case.
It seems that no jobs that are submitted via the the 2.1 submit run with new clusters end up being expressed as a "multi-task job".
Description
We're using the python sdk to get information of our runs; however, cluster_instance is not being populated as expected. It's being reported as None in the object, however, the cluster exists, is running, and the cluster_id can be fetched via alternative ways.
Reproduction
Expected behavior
A running job with a active cluster should always have a cluster_instance object.
Debug Logs
Other Information
Additional context