Closed gaoshihang closed 3 weeks ago
Thanks for the report, I had a suspicion this might be the case. Guess we need to figure out where dbt catches the cancel and put a call to the job api there.
Hi @benc-db thanks for the reply! do you have any idea where DBT catches it? maybe I can try to do something on this.
Don't see a place offhand, so will reach out to dbt team.
Thank you very much! please let me know too!
Hi @benc-db ,
I think dbt is execute the cancel action in https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/task/runnable.py#L354
Can we add a list to record all the python model jobs in DatabricksAdaptor: https://github.com/databricks/dbt-databricks/blob/main/dbt/adapters/databricks/impl.py#L145
And use Databricks api to kill them in https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/task/runnable.py#L354
I think if we override cancel_open() to call super, and then call the jobs api, that should do it. Right now we inherit cancel_open from the SQLConnectionManager.
yeah, I got your mean. now in DatabricksConnectionManager, it inherit SparkConnectionManager, then inherit SQLConnectionManager, so it calls the cancel_open() in it.
But we still need a place to record all the run id for Python models? where do you think its appropriate to do this?
Hmm, I don't see a direct path to hook these up :/. @jtcohen6 for viz.
I think maybe class DatabricksConnectionManager is an appropriate place since it maintains all the SQL connections.
When we submit a python model in https://github.com/databricks/dbt-databricks/blob/main/dbt/adapters/databricks/python_submissions.py#L148 We can get the runId, then we put save all these run_id in DatabricksConnectionManager.
And we override cancel_open() to call super, then traverse the run_id list, call job api to kill them all.
This would work, but not sure how you are going to propagate the run_ids back to DatabricksConnectionManager. If you see a path I don't, by all means submit a PR :)
Thanks! I'll try tomorrow. I'm thinking if that can't happen, could we write the run_id to a file under target directory? do you think this can be a method?
I believe that could work. My primary concern with that approach is ensuring the process handles concurrency appropriately.
thanks, I'll try some code first.
Hi @benc-db, is there a way we can get target directory variable in python_submission.py?
I agree that storing run/job IDs on DatabricksConnectionManager
feels like the right way to go here. (We're investigating something similar for cancelling BigQuery jobs).
I don't think we should use files in target/
as a way to manage that state. It feels like we need a more straightforward way for python_submission_helpers
to (optionally) return state back to the Adapter or ConnectionManager. I don't have an immediate instinct here, but I'm happy to open a dbt-adapters
issue for further discussion.
Hi @benc-db @jtcohen6 Yes, I agree that we can storing run ids in DatabricksConnectionManager
, but I can't find a path how to integrate DatabricksConnectionManager
to python_submission.py
.
I have another idea and try some code: https://github.com/databricks/dbt-databricks/pull/690 This idea is: we store run_id as file in Databricks's share workspace(since we already use this way to store the python model notebook), when the job is finished, we delete the file from workspace. or when the job is canceled, we delete the file too. in this way, we can avoid concurrency problems between different job.
Could you please give me some advice here, if this way can work, I'll make the code ready. thanks!
Describe the bug
We found that when a python model is canceled from terminal, it doesn't have a cancel method to cancel the job_cluster on Databricks.
We think this could cause resource/cost waste and data error in some case.
Steps To Reproduce
1.include a SQL model and a Python model in project. 2.execute them together. 3.use Control + C to exit the DBT execution. 4.you can see the SQL model on SQL warehouse has been canceled successfully, but Python Model on Job Cluster is still running. 5.and the DBT execution will stuck until the Python model is finished.
Expected behavior
All model should be canceled no matter SQL model or Python model.
Screenshots and log output
System information
The operating system you're using:
The output of
python --version
:Additional context
Add any other context about the problem here.