JenspederM / kedro-databricks

A Databricks Plugin for Kedro
MIT License
12 stars 4 forks source link

Some jobs have too many tasks #32

Open astrojuanlu opened 3 months ago

astrojuanlu commented 3 months ago

https://linen-slack.kedro.org/t/22732083/announcing-kedro-databricks-https-github-com-jenspederm-kedr#0dffcc72-1ac6-4c4f-9bb0-47a654643cd0

I have been trying out your plugin today. It sucks that databricks limits to max of 100 tasks in a single job. I have more than 100 nodes in my default pipeline T.T which is why it is crashing...

A possible solution would be to group tasks, like kedro-airflow does since https://github.com/kedro-org/kedro-plugins/pull/241 cc @ankatiyar, @sbrugman

There are other possible solutions maybe.

I'm wondering, does this "100 tasks per job" limitation depend on the Databricks config/cluster? Or is it universal?

datajoely commented 3 months ago

I'm wondering, does this "100 tasks per job" limitation depend on the Databricks config/cluster? Or is it universal?

My hunch is 1 task is 1 container - the same reason you shouldn't need 28 containers to run spaceflights

JenspederM commented 3 months ago

I'm wondering, does this "100 tasks per job" limitation depend on the Databricks config/cluster? Or is it universal?

My hunch is 1 task is 1 container - the same reason you shouldn't need 28 containers to run spaceflights

Think you're right. As far as I can tell, it's universal. But I haven't really found any mention of it outside of this issue from 2021: https://community.databricks.com/t5/data-engineering/how-many-jobs-can-i-create-in-my-databricks-workspace/td-p/18111