dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.49k stars 1.45k forks source link

Dagster-Databricks Pyspark Step Launcher: Permissions Not Getting Set #25056

Open matt-weingarten opened 2 weeks ago

matt-weingarten commented 2 weeks ago

What's the issue?

We're using the DatabricksPysparkStepLauncher to submit our job runs for our assets. We want to have the ability for our users to see the running status and output (logs, metrics, etc.) from the jobs, but what seems to be happening is that the submit_run part of the launcher doesn't take into account the permission map that is an option to pass to the launcher itself.

Ideally, this should be addressed so that something other than the machine user/Dagster runner can see those job outputs.

What did you expect to happen?

Ideally, you'd be able to pass the permissions map to the task launch for Databricks so that whichever users you've identified can view the job status/output/etc.

How to reproduce?

Run the DatabricksPysparkStepLauncher and view in Databricks as a non-admin.

Dagster version

1.5.13

Deployment type

Dagster Helm chart

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization. By submitting this issue, you agree to follow Dagster's Code of Conduct.

Gleb-appgrowth commented 1 week ago

I see that this is a new issue. What's interesting is that we in our team are facing a similar issue, and we've been trying to figure the reason out for 2 days.

We have Dagster 1.8.7. We are using DatabricksPySparkStepLauncher to launch a Spark job on a Databricks cluster. However, at least in Dagster 1.8.7 there is a keyword argument "permissions" in DatabricksPySparkStepLauncher class. So they do have this option in the newer version, but looks like it is not working properly.

The problem is that we are trying to create a new cluster for each job with specific permissions. However, it is constantly failing with the following error: dagster._check.functions.CheckError: Failure condition: Databricks run {databricks_run_id} has null cluster_instance.

Looking at the source code at dagster_databricks/databricks_pyspark_step_launcher, the error is raised by this part: check.failed("Databricks run {databricks_run_id} has null cluster_instance")

_Problem: for some reason, run_info.cluster_instance is not getting set where it should be getting set. It is always None. And this check only gets triggered if we set the permissions argument. If this argument is an empty dict, this check is not getting triggered and the job successfully runs._

Moreover, in dagster_databricks/databricks.py there is a comment on line 575 Run.cluster_instance can be None. However, run.tasks[0].cluster_instance (that acts as a fallback option) also has None cluster instance, judging by my logs.

Overall, the newer Dagster versions have a specific argument for the "permissions" argument, but looks like it is not working properly.

Please let me know if my case should be reported as a separate GH issue.

matt-weingarten commented 1 week ago

I see that this is a new issue. What's interesting is that we in our team are facing a similar issue, and we've been trying to figure the reason out for 2 days.

We have Dagster 1.8.7. We are using DatabricksPySparkStepLauncher to launch a Spark job on a Databricks cluster. However, at least in Dagster 1.8.7 there is a keyword argument "permissions" in DatabricksPySparkStepLauncher class. So they do have this option in the newer version, but looks like it is not working properly.

The problem is that we are trying to create a new cluster for each job with specific permissions. However, it is constantly failing with the following error: dagster._check.functions.CheckError: Failure condition: Databricks run {databricks_run_id} has null cluster_instance.

Looking at the source code at dagster_databricks/databricks_pyspark_step_launcher, the error is raised by this part: check.failed("Databricks run {databricks_run_id} has null cluster_instance")

_Problem: for some reason, run_info.cluster_instance is not getting set where it should be getting set. It is always None. And this check only gets triggered if we set the permissions argument. If this argument is an empty dict, this check is not getting triggered and the job successfully runs._

Moreover, in dagster_databricks/databricks.py there is a comment on line 575 Run.cluster_instance can be None. However, run.tasks[0].cluster_instance (that acts as a fallback option) also has None cluster instance, judging by my logs.

Overall, the newer Dagster versions have a specific argument for the "permissions" argument, but looks like it is not working properly.

Please let me know if my case should be reported as a separate GH issue.

This is the crux of our issue as well. We solved it on our end by creating an overriding of the function where that happens, but the PR I've linked will (in theory) solve it.