dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.73k stars 1.48k forks source link

Specify which folders to include in the .zip that is created and uploaded by the databricks_pyspark_step_launcher to Databricks #18099

Open weberdavid opened 1 year ago

weberdavid commented 1 year ago

Discussed in https://github.com/dagster-io/dagster/discussions/17948

Originally posted by **weberdavid** November 13, 2023 **Context** We use the `databricks_pyspark_step_launcher` to execute assets on Databricks. As we also deal with dbt-integrations, and a lot of other code that is not connected to the ingestion jobs running on Databricks, we want to be careful about what we include in the .zip that is generated by the `databricks_pyspark_step_launcher`. **Problem** In some instances, the .zip file gets quite big (due to the number of files in the repos), and uploading time to Databricks increases tremendeously. We want to avoid waiting for several minutes just for the code to arrive in Databricks. An additional problem we start to see is that automatically created folders in the dbt target folder (f.ex. when you load each dbt model individually) are too long on Windows, so they exceed the maximum path length of 256 characters. This results in a `FileNotFoundError: [WinError 2] The system cannot find the file specified` once the `databricks_pyspark_step_launcher` tries to .zip the dbt-target folder. **Question / Solution** There should be a way of specifying, which folders should be included in the .zip process of the `databricks_pyspark_step_launcher`. I investigated a bit, and it seems that [this ](https://github.com/dagster-io/dagster/blob/8978f184b8b041e7f37fbdd886a1800d27fc87a2/python_modules/libraries/dagster-databricks/dagster_databricks/databricks_pyspark_step_launcher.py#L492) is the code where the zipping is executed. The referenced function `build_pyspark_zip` ([here](https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-pyspark/dagster_pyspark/utils.py)) seems to take a parameter `exclude`, where you can pass a list of folders to exclude in the zipping. However, this option is not propagated upwards to the `databricks_pyspark_step_launcher`. If there is no other way to achieve this, I am happy to create an official issue for that and work on a PR that gives this "exclude option" on the `databricks_pyspark_step_launcher` level.
weberdavid commented 1 year ago

18151

I created a PR that solves this issue.

As the "Windows pathlength problem" is a real blocker for us, we would urgently welcome this PR and the ability to circumvent this issue.