databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
195 stars 104 forks source link

Python model doesn't have cancel method #684

Closed gaoshihang closed 3 weeks ago

gaoshihang commented 1 month ago

Describe the bug

We found that when a python model is canceled from terminal, it doesn't have a cancel method to cancel the job_cluster on Databricks.

We think this could cause resource/cost waste and data error in some case.

Steps To Reproduce

1.include a SQL model and a Python model in project. 2.execute them together. 3.use Control + C to exit the DBT execution. 4.you can see the SQL model on SQL warehouse has been canceled successfully, but Python Model on Job Cluster is still running. 5.and the DBT execution will stuck until the Python model is finished.

Expected behavior

All model should be canceled no matter SQL model or Python model.

Screenshots and log output

image

System information

image

The operating system you're using:

The output of python --version:

Additional context

Add any other context about the problem here.

benc-db commented 1 month ago

Thanks for the report, I had a suspicion this might be the case. Guess we need to figure out where dbt catches the cancel and put a call to the job api there.

gaoshihang commented 1 month ago

Hi @benc-db thanks for the reply! do you have any idea where DBT catches it? maybe I can try to do something on this.

benc-db commented 1 month ago

Don't see a place offhand, so will reach out to dbt team.

gaoshihang commented 1 month ago

Thank you very much! please let me know too!

gaoshihang commented 1 month ago

Hi @benc-db , I think dbt is execute the cancel action in https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/task/runnable.py#L354 Can we add a list to record all the python model jobs in DatabricksAdaptor: https://github.com/databricks/dbt-databricks/blob/main/dbt/adapters/databricks/impl.py#L145 image

And use Databricks api to kill them in https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/task/runnable.py#L354

benc-db commented 1 month ago

I think if we override cancel_open() to call super, and then call the jobs api, that should do it. Right now we inherit cancel_open from the SQLConnectionManager.

gaoshihang commented 1 month ago

yeah, I got your mean. now in DatabricksConnectionManager, it inherit SparkConnectionManager, then inherit SQLConnectionManager, so it calls the cancel_open() in it. image

But we still need a place to record all the run id for Python models? where do you think its appropriate to do this?

benc-db commented 1 month ago

Hmm, I don't see a direct path to hook these up :/. @jtcohen6 for viz.

gaoshihang commented 1 month ago

I think maybe class DatabricksConnectionManager is an appropriate place since it maintains all the SQL connections.

When we submit a python model in https://github.com/databricks/dbt-databricks/blob/main/dbt/adapters/databricks/python_submissions.py#L148 We can get the runId, then we put save all these run_id in DatabricksConnectionManager.

And we override cancel_open() to call super, then traverse the run_id list, call job api to kill them all.

benc-db commented 1 month ago

This would work, but not sure how you are going to propagate the run_ids back to DatabricksConnectionManager. If you see a path I don't, by all means submit a PR :)

gaoshihang commented 1 month ago

Thanks! I'll try tomorrow. I'm thinking if that can't happen, could we write the run_id to a file under target directory? do you think this can be a method?

benc-db commented 1 month ago

I believe that could work. My primary concern with that approach is ensuring the process handles concurrency appropriately.

gaoshihang commented 1 month ago

thanks, I'll try some code first.

gaoshihang commented 1 month ago

Hi @benc-db, is there a way we can get target directory variable in python_submission.py?

jtcohen6 commented 1 month ago

I agree that storing run/job IDs on DatabricksConnectionManager feels like the right way to go here. (We're investigating something similar for cancelling BigQuery jobs).

I don't think we should use files in target/ as a way to manage that state. It feels like we need a more straightforward way for python_submission_helpers to (optionally) return state back to the Adapter or ConnectionManager. I don't have an immediate instinct here, but I'm happy to open a dbt-adapters issue for further discussion.

gaoshihang commented 1 month ago

Hi @benc-db @jtcohen6 Yes, I agree that we can storing run ids in DatabricksConnectionManager, but I can't find a path how to integrate DatabricksConnectionManager to python_submission.py.

I have another idea and try some code: https://github.com/databricks/dbt-databricks/pull/690 This idea is: we store run_id as file in Databricks's share workspace(since we already use this way to store the python model notebook), when the job is finished, we delete the file from workspace. or when the job is canceled, we delete the file too. in this way, we can avoid concurrency problems between different job.

gaoshihang commented 1 month ago

Could you please give me some advice here, if this way can work, I'll make the code ready. thanks!