JenspederM / kedro-databricks

A Databricks Plugin for Kedro
MIT License
13 stars 5 forks source link

Unlink databricks target and kedro environment #38

Closed npfp closed 3 months ago

npfp commented 3 months ago

Currently a kedro environment is bound to a Databricks Asset bundle target. This doesn't allow to ma

Our use case:

What we'd want to achieve would be to be able to run

kedro databricks deploy --target dev --env first_pipeline

But currently as the env parameter is used for both the target and the kedro environment it would require us to define first_pipeline_dev, second_pipeline_dev, first_pipeline_prod, second_pipeline_prod kedro environments and the target equivalent.

JenspederM commented 3 months ago

@npfp I'm guessing you mean the other way around?

I'm all for adding a --target arg to the cli to allow deploying a single pipeline, but I feel that --env is more commonly used to specify things such as dev, prod, etc..

npfp commented 3 months ago

Just to be sure we use the same terminology.

By target, I meant an entry in targets of the databricks.yml file: the target databricks environment (that can lives in different workspaces as in this image: development, staging, production)/

By env, I meant a kedro conf folder. For example base, local, etc. (https://docs.kedro.org/en/stable/configuration/configuration_basics.html#configuration-environments)

I see your point, indeed, we might have misused the kedro environment by creating 1 environment by pipeline. If you had any links to how to handle multiple independent pipelines in a single repo that would be much appreciated!

npfp commented 3 months ago

What I find confusing is that locally I can run:

kedro run --pipeline=my_pipeline --env=my_env

to specify running a particular environment. But I can't find a way to do the same within databricks.

If I look to the definition of the python_wheel_task in databricks.yml I see no usage of env:

                python_wheel_task:
                    package_name: my_package
                    entry_point: databricks_run
                    parameters:
                    - --nodes
                    - my_node
                    - --conf-source
                    - /dbfs/FileStore/my_package/conf
                    - --package-name
                    - my_package
                libraries:
                -   whl: ../dist/*.whl

where I would expect to have a

                    parameters:
                    - --nodes
                    - my_node
                    - --conf-source
                    - /dbfs/FileStore/my_package/conf
                    - --package-name
                    - my_package
                    - --env 
                    - my_env

Do I miss something?

JenspederM commented 3 months ago

No you are right! I'll have to think about this.

I use ‘kedro package‘ to build the wheel and compress conf and here the env is used to only grab the relevant configuration. I thought that would be enough, but it's still in an env folder and hence --env is required in the task spec

JenspederM commented 3 months ago

Good catch!

npfp commented 3 months ago

Great, thx for the confirmation! In case it helps, here is an attempt to fix this:

https://github.com/JenspederM/kedro-databricks/pull/39